The Rise Of BS AI Standards And Agentic Benchmarks

Jul 01, 2025

Thank you all for subscribing and supporting this newsletter. The community just grew to over 14,000 subscribers and followers. What brought you here? What are you most interested in? Let me know in the comments.

Rapid organic growth means a diversity of views and interests. It also means some of you won’t agree with everything I write, and that’s a good thing. If you only read newsletters and content that makes you feel good about your ideas, but doesn’t make you continuously reevaluate them with new information, it’s impossible to grow. I appreciate you being part of a community that challenges and disrupts.

Oh, Great, Another Benchmark

Why are we suddenly seeing so many AI agent benchmarks and standards being proposed? Why so much jockeying to define terms and own those definitions?

Back in 2015, a legacy Big Data influencer told me that if I wanted to do anything big in data science, I would have to start creating standards for the role. What we're seeing now with AI agent benchmarking is the exact same philosophy. It wouldn't have worked then, and it’s not working now. Just like that influencer, it’s all marketing with zero substance.

However, the economic and power dynamic reasoning behind creating BS standards and benchmarks is valid. Proposing a standard and convincing enough people to adopt it gives you significant power to influence the direction of AI. That power can be used to put your chosen solution(s) on the throne, and there’s a lot of money to be made.

That's why so many AI labs, universities, companies, and research groups are trying to take the high ground. The problem with every benchmark that's been proposed is the people behind it. They are undeniably experts in the field of AI. However, I rarely see anyone who's an expert in the workflows they're benchmarking AI against.

The AI Last Mile Problem Strikes Again

Rejecting BS standards in 2015 led to solving tough problems. Data Scientist was the “sexiest job” in the world because we could make money fall from the sky. That’s the way the role was sold, so that became my benchmark. If this is real, data science should deliver more cash than alternative technologies.

Holding to that standard led to tough realizations. Data science wasn’t the best approach for every business problem and customer need. That forced me to develop strategy frameworks and opportunity discovery methods that align technology with the value it creates better than alternatives. I had to address model reliability, and that completely changed the way I trained models.

I realized that modern businesses would run on platforms that leveraged multiple technologies. That forced me to create platform and product strategy frameworks that enabled multiple technologies to amplify each other. I had to address model integration and product design, which had even bigger impacts on my technical methods.

The danger of BS standards is that they let us avoid solving tough problems. If the standard is real-world utility, we are forced to confront the complexities of the last mile. If the standard is a contrived benchmark, we can get away with delivering the perception of utility.

Benchmarking Agents

Agents should be benchmarked against people, because that’s how they are being sold. However, most benchmarks still follow an academic paradigm. Standards must be consistent and repeatable, but that's also the flaw in every benchmark that's being proposed.

Goddard's Law explains that as soon as a standard or benchmark gains widespread adoption, it will be gamed. Gaming a benchmark means starting to build solutions for the benchmark versus the thing the benchmark is designed to test. I wrote an article about this in social media marketing and consulting. Controlling the benchmarks means controlling the bar for success and failure, but the benchmark quickly strays from actually measuring success.

As soon as you propose a stable benchmark that is widely accepted, every major AI lab will game the system by creating their agentic product in such a way that it always passes or gets higher marks than competitors. As we've seen with recent scandals from companies like Meta, gaming those benchmarks, while lucrative, doesn't actually indicate any sort of improved functionality in the AI itself. Benchmarks quickly devolve into marketing tools.

Most new agentic products that show groundbreaking progress towards benchmarks are widely laughed at by the people who do the job that the agent is allegedly capable of doing. Within weeks of major releases, you see practitioners and domain experts shredding the new product online. For two years, people have reached out to me after one of my posts about this goes viral to explain how the latest agent is wholly inadequate for their needs.

This is the AI Last Mile Problem. People building the standards know it's easier to deliver the perception of an outcome than it is to deliver the actual outcome. There’s easy money in the perception of value without all the work and expense of delivering that value. The major AI labs and agentic platform providers know the game and are willing to pay to play.

BS Standards Will Become Barriers

AI won't make any forward progress until we overcome this barrier. The work people get paid well to do is complex. Many of our workflows aren't stable or reproducible. There's nothing academic about the enterprise, so our benchmarks will never fit that paradigm. AI is a new product category that’s capable of managing new, more complex workflows.

That's why the standards that we've used to benchmark model accuracy and reliability for the last 15 to 25 years are wholly inadequate for the tasks we are putting them to. The only benchmark or standard we can bank on is human-level performance and real-world outcomes. Agentic systems are entering unfamiliar territory, and again, BS standards will fail us.

When we teach from the book of BS standards, the next generation of AI engineers and information architects gain skills that aren’t relevant to the jobs that must be done. They enter the workforce convinced that the wrong way is the best practice. Companies build job descriptions around BS standards, and teams are filled with people who think the same way.

I saw the same thing happen with data science, data engineering, and now AI engineering. This ends with another survey saying that over 80% of AI projects fail to deliver value, and customers rejecting AI products because they don’t work. If we let BS standards take over AI, we won’t solve the hard problems, and AI will fail to live up to its potential.

An Old Approach To Benchmarks

When I tested complex models for clients, I would run the model “in shadow.” Google Search used the same approach to deploying machine learning that supported its golden goose because it had to work. Methods change when real-world outcomes become the standard.

People would continue to do the work, but behind the scenes, the model was serving inference. We compared the model output to the human output. When the model looked like it was as good as the person doing the job, we would switch it on in production.

People would continue to do their work, and they would be shown the model’s outputs. Part of their workflow was being automated, but they maintained autonomy and oversight. We logged how they audited the model and any changes they made to its outputs. That was the training data for another pass of model improvement.

As the model’s reliability improved, the number of changes dropped to nearly zero. Once workers got comfortable with the model, the percentage of model outputs they audited dropped. That’s when we knew we could have people hand over autonomy to the model with continuous monitoring vs. human oversight.

The “in shadow” paradigm became a new dimension of my product strategy maturity models. The most amazing AI will fail if no one adopts it. From a product design perspective, AI must be implemented to accelerate adoption. People won’t adopt products they don’t understand, even if they work. Customers don’t hand over autonomy to products they don’t trust, even if they work.

People interact with AI differently than other technology product categories. We must design and implement them differently, but you only learn that when your benchmark is real-world user adoption.

The same process works with agents. However, real-world benchmarks are neither cheap nor easy. Integrating models into workflows and reengineering processes to maximize the value delivered are complex. The iterative approach and longitudinal method don’t fit the quick and easy BS AI benchmark paradigm. This is challenging, but this is what actually works.

Eventually, we will develop simulations and playgrounds that automate these loops. NVIDIA and Salesforce are both working in this direction, but we’re still in the early days. Simulations will become a new phase of model training and a more effective approach to synthetic data. Simulations will reduce the “in shadow” and human-machine teaming validation times, but not eliminate the need for real-world validation prior to shipping an agent.

Agentic Systems Engineer Outcomes

The purpose of agentic systems is to detect intent and deliver the desired outcome. That’s what agents do that isn’t possible with other technology. BS AI benchmarks take the wind out of agentic platforms by focusing them on outcomes that don’t matter. You don’t know what information to gather if you’re not tracking the right outcomes.

In the next post in this series, I will explain two equally critical paradigms for agentic system design: information architecture and process reengineering.

How To Support This Newsletter

Consider Subscribing To Show Your Support For The Community.

My award-winning, instructor-led certification course series continues with a new cohort of my AI Product Strategy Certification starting on July 12th.

If the instructor-led format doesn’t work for your schedule, I also offer self-paced versions of all my certifications. Learn more and enroll here.