Dear CIO,

The AI era has given leaders something we have always had a soft spot for: more numbers. We can now measure how much code assistants generate, how many suggestions get accepted, how many pull requests get opened, how often people commit, and even how many tokens get consumed along the way. In a lot of organizations, those numbers have quickly become the shorthand for progress. More AI-written code must mean more productivity. More PRs must mean we’re moving faster. More tokens must mean more output. It is a neat story, but it is also the wrong one.

Deming spent a good part of his life warning us about this exact pattern, which confuses visible activity with real improvement. He was not interested in whether a part of the system looked busier. He cared about whether the system as a whole was actually getting better. By that standard, a lot of today’s conversation around AI metrics feels less like progress and more like a familiar mistake. We are counting what is easy to count and treating it like insight.

Best Regards,
John, Your Enterprise AI Advisor

Dear CIO

The Metrics Trap in the Age of AI

What Deming Would Say About AI Metrics

❝

“I’d say maybe 20%, 30% of the code that is inside of our repos today and some of our projects are probably all written by software.”

Satya Nadella, CEO, Microsoft

❝

“Today, more than a quarter of all new code at Google is generated by AI, then reviewed and accepted by engineers. This helps our engineers do more and move faster.”

Sundar Pichai, CEO, Alphabet/Google

❝

“There is zero human authoring. Engineers review and approve, but the code is written entirely by AI agents.”

Praveen Neppalli Naga, CTO, Uber

When everything becomes measurable

One of the big shifts with AI tools is that they have made software work more visible. Things that used to sit inside judgment calls or messy engineering trade-offs now show up as dashboards: suggested lines, accepted completions, generated diffs, token usage, commit counts, PR volume. Once those numbers exist, they start to shape the narrative. If the graphs are going up, it feels like things must be improving, but visibility is not the same as meaning.

A number can be precise and still tell you nothing useful. Worse, it can point you in the wrong direction. That’s the issue with a lot of current AI metrics. The counts themselves aren’t necessarily wrong, but the conclusions we draw from them often are. More AI-generated code does not tell us the software is better. More commits do not tell us that delivery has improved. More pull requests don’t mean customers are seeing value any faster. And token consumption, by itself, tells us almost nothing. These are not measures of improvement. They are measures of motion.

Throughput is not flow

This is where the 2025 DORA framing becomes especially useful. DORA now distinguishes between throughput and instability in software delivery performance. Throughput concerns how changes move: lead time, deployment frequency, and recovery time. Instability concerns what happens when they land: changing the failure rate and rework. The point is straightforward but profound. Speed alone does not describe delivery performance. Neither does volume. You cannot understand the health of the system by looking only at how much is moving. (dora.dev)

❝

“Improvements made anywhere besides the bottleneck are an illusion”

Eli Goldratt

That distinction gets very close to what Deming would have cared about: the difference between throughput and flow. Throughput is easy to celebrate, but flow is harder. Flow asks whether work moves through the whole system with less friction, less delay, less harmful variation, and less waste. Does a change move cleanly from idea to review, from review to test, from test to deployment, from deployment to stable operation? Or does faster code generation simply create more review burden, more integration problems, more operational noise, and more downstream rework?

This is the central weakness in the current AI productivity story. What many organizations are calling productivity is really just throughput in disguise. They are measuring how much material enters the pipeline, not whether value flows more effectively through the system. DORA’s 2025 work makes the warning explicit: AI can improve throughput while increasing instability if the underlying system is weak. That is almost a Deming thesis in modern language. A subsystem can become more productive while the whole becomes less predictable. (dora.dev)

More code is not the same as more value

One of the clearest signs of confusion in the AI metrics conversation is the fascination with code volume. The assumption seems obvious that if AI helps teams produce more code, then teams must be becoming more productive. However, Deming would have rejected that logic outright. More code may mean more value, but it may also mean more churn, more duplication, more speculative implementation, more maintenance burden, more risk, and more defects introduced upstream and discovered later. It may mean the system is generating artifacts faster while degrading the quality of decisions embedded in them.

Code has never been a clean proxy for value. AI makes the problem worse by dramatically lowering the cost of producing code-like output. When generation becomes easier, the presence of more output becomes less informative. That is why percentages like “how much of our code was written by AI” are so weak as management indicators. They tell us that something was produced. They do not tell us whether the product improved, whether customers benefited, whether maintainability increased, or whether the organization’s capacity to change software got better. Deming would see this clearly as managers mistaking increased local production for improved system performance.

Optimizing the part, not the whole

Suppose AI helps developers draft software faster. At first glance, that sounds like progress. Deming, though, would immediately ask what happens next. Does review get easier or harder? Does integration smooth out, or become more fragile? Does testing shrink or expand? Does operational stability improve, or does the system pay for faster generation with more incidents and more cleanup?

These questions matter because local speedups often push costs elsewhere. A faster upstream process can flood downstream steps with more work, greater variation, and longer delays. The system appears faster in one place while becoming worse overall. That is one of Deming’s oldest lessons: improving a part does not necessarily improve the whole. In fact, it often damages the whole when managers fail to understand the system's interdependence.

This is why AI metrics focused on generation are so incomplete. They treat drafting as the story. But drafting is only one stage in a long chain of engineering work. If AI reduces the effort of producing code while increasing the effort of reviewing, validating, integrating, and operating it, then the organization has not necessarily improved at all. It may simply have moved labor into less visible parts of the process.

When the metric becomes the target

Deming also understood that measurement does not simply describe behavior. It shapes it. Once leaders begin celebrating AI-generated code, share volume, commit volume, or PR count, teams adapt. Engineers split work into more pull requests. Commits become more frequent because frequency is noticed. Boilerplate expands because output is rewarded. Model usage rises because visible AI activity becomes culturally synonymous with productivity. The organization learns to perform for the dashboard. This is not an accidental side effect. It is one of the predictable consequences of using narrow numerical targets without a theory of the system behind them. People do what the measure seems to reward. Over time, the metric stops reflecting the work and starts distorting it.

That is the danger of many current AI metrics. They encourage performative productivity: more measurable output without corresponding gains in quality, predictability, or customer value. A team can improve at producing artifacts while worsening at engineering.

Token spend is not output

Nothing reveals the confusion more clearly than the emerging use of token cost as a performance signal. Some organizations now talk about token consumption as if it were evidence of work accomplished or even elite programming status. More spending implies more engineering activity. Token efficiency is discussed as though it were analogous to labor productivity.

Deming would likely find this absurd. Token consumption is not output. It is an input. It is much closer to electricity used on a factory floor than finished goods leaving the line. High token usage might reflect effective exploration of a hard problem. It might also reflect duplicated effort, poor prompting, weak context management, or indiscriminate dependence on the model.

By itself, it tells us almost nothing about whether the system improved. At best, token cost can function as a secondary process indicator when paired with outcomes that matter: lower lead time, fewer escaped defects, reduced rework, better documentation, faster onboarding, and cleaner migrations. But treated as a standalone sign of increased output, it is almost meaningless. Deming would likely call it a category error.

What he would ask instead

If Deming were looking at AI in engineering organizations, he would redirect attention from artifacts to performance. Are defects going down? Is rework shrinking? Is delivery becoming more predictable? Is operational stability improving? Are customers seeing faster resolution of real problems? Are teams learning faster? Is the variation between teams narrowing? Is dependence on heroics being reduced? Is the organization becoming more capable, or merely more active?

These are harder questions, but they are better ones. They also point toward the deeper promise of AI. The real promise is not that AI can produce more code. It is that AI might help organizations understand systems faster, reduce friction in change, improve feedback loops, strengthen learning, and build software more effectively. If those things are not improving, then a rise in generated output is mostly noise.

The problem is not measurement. It is a bad measurement.

Deming was never against data. He was against shallow managerial use of data. He would not say that AI-generated code share, pull request counts, commit volume, or token spend should never be measured. He would say they are being overpromoted. They are secondary signals, not primary evidence of improvement. They may help diagnose behavior in the process, but they should never be mistaken for outcomes.

Used carefully, they can inform inquiry. Used carelessly, they become theater. A Deming-compatible approach to AI metrics would subordinate them to measures that reflect the health of the system: lead time, deployment frequency, recovery time, rework, reliability, maintainability, customer outcomes, and organizational learning. DORA’s current guidance points in the same direction: software delivery performance should be understood through multiple indicators, not collapsed into a single, simplistic productivity story.

What would Deming say now?

If Deming had to summarize the problem in one sentence, it might be this: You are measuring what is easy to count, not what matters.

And if he wanted to be blunter, perhaps this:

If AI results in more code, more commits, more pull requests, and more token consumption, that may be evidence of improvement or of the system becoming better at producing noise. Without understanding quality, variation, and the performance of the whole, you do not know which. That is the real weakness in today’s AI metrics. They reward throughput without proving flow. They highlight the rate at which artifacts are produced while leaving unanswered the harder question of whether the system delivers better software with less instability, less waste, and less hidden cost.

Deming would not be anti-AI, but he would be anti-delusion. Much of today’s AI measurement discourse, however sophisticated it appears, is still dangerously close to mistaking output for progress.

How did we do with this edition of the AI CIO?

Deep Learning

Artificial Intelligence didn’t start in Silicon Valley. It began with centuries of thinkers who refused to treat intelligence as something mystical. Inspired by Rebels of Reason, this live 8-part biweekly course (starting April 22nd) traces AI as a long intellectual journey, not a hype cycle, exploring how machines learned to count, reason, search, learn, and ultimately generate language through the ideas that made it possible. With no heavy math or prerequisites, it focuses on the breakthroughs that shaped modern AI, perfect for technologists, leaders, students, and anyone trying to understand what’s really behind today’s AI moment. Register Here.
The Artificially Intelligent Enterprise writes why your AI tools are missing context.
AI Tangle covers OpenClaw’s ChatGPT moment and a $25 Billion chip factory.

Regards,

John Willis

Your Enterprise IT Whisperer

Follow me on X

Follow me on Linkedin

Dear CIO is part of the AIE Network. A network of over 250,000 business professionals who are learning and thriving with Generative AI, our network extends beyond the AI CIO to Artificially Intelligence Enterprise for AI and business strategy, AI Tangle, for a twice-a-week update on AI news, The AI Marketing Advantage, and The AIOS for busy professionals who are looking to learn how AI works.

💡The Metrics Trap in the Age of AI