In partnership with

The Problem With Measuring AI by How It Feels

Most engineering leaders I talk to have the same problem right now: their teams are using AI tools, things feel faster, but when the CEO asks “what’s the ROI on our AI investment?”, it’s difficult to speak to that.

They don’t have the numbers. Or worse, they have the wrong ones. One study shows developers completing tasks 56% faster with AI. Another shows them working 19% slower, while thinking they were 20% faster. And yet another finds that 95% of AI initiatives at companies fail entirely. The 2025 DORA Report calls this the “AI Mirror Effect”: AI amplifies both good and bad engineering practices. Teams with strong fundamentals get better. Teams without them get worse.

If you’re tracking “lines of code generated by AI” or “number of Copilot completions accepted”, you’re measuring activity, not impact.

Here’s what to actually measure and where to find the data.

What Leaders Think They Should Track

The instinct is to measure AI output directly. Completions accepted. Prompts per day. Percentage of code written by AI. These feel tangible.

But they tell you almost nothing useful.

A recent study from Multitudes tracked 500+ developers across multiple organizations through their AI rollouts, combining telemetry data, surveys, and interviews. One of the main findings? Engineers merged 27.2% more PRs after AI adoption, but also did 19.6% more out-of-hours commits. The productivity number looks great in a slide deck. The out-of-hours number tells a different story about what’s actually happening on your teams.

If you’re only tracking throughput, you’re missing half the picture.

The Metrics That Actually Matter

Think about AI metrics in three tiers: adoption, velocity, and quality. You need all three, and they need to be read together.

Tier 1: Adoption (Are people actually using it?)

  • Active usage rate. What percentage of your engineers are using AI tools weekly? Not installed. Not enabled. Actually using. Source: Copilot’s metrics API, Cursor/Claude Code admin dashboards, or engineering intelligence platforms if you’re multi-tool.

  • Workflow integration depth. Are developers using AI for just fancy autocomplete, or also for test generation, code review, documentation, and debugging?
    Source: A short quarterly survey (3-5 questions) asking which workflows engineers use AI for. Self-reported data is genuinely more useful here than telemetry.

Adoption alone isn’t a success metric. It’s a prerequisite. The Multitudes research confirmed something important: buying AI tooling doesn’t guarantee adoption. The same tool, Cursor, showed wildly different adoption curves at two different organizations. The difference wasn’t the tool, it was how leadership supported the rollout.

One more thing worth watching: AI usage follows a Pareto distribution. In one org studied, the top 10% of users accounted for nearly 57% of total AI costs. Know who your super-users are, they’re your best internal teachers. Source: Your billing data. Sort AI spend by individual or team. The distribution will jump out immediately.

Tier 2: Velocity (Is it moving the needle?)

  • Merge frequency delta. The study showed a 27.2% increase in merge frequency across 372 contributors. But some of that may just be people working longer hours. Normalize for time worked.
    Source: Merged PR counts from GitHub/GitLab’s API. Multitudes, LinearB, or Sleuth can show pre/post AI trendlines. The key is having a baseline from before your rollout.

  • Time-in-phase reduction. Where is time actually shrinking? Coding? Review? Testing? This tells you where AI helps, not just if.
    Source: Phase-level cycle time breakdowns from Multitudes, LinearB or Sleuth.

Here’s the trap: Faros AI tracked 10,000+ developers and found that individually, developers completed 21% more tasks and merged 98% more PRs. But organizational delivery stayed flat. Review time increased 91%. The gains got absorbed by downstream bottlenecks. Compare each team to its own baseline and always pair velocity metrics with quality and wellbeing signals.

Tier 3: Quality (Is it helping or hurting?)

  • PR size trends. PR sizes typically increased after AI rollouts because AI-generated code tends to be verbose. Larger PRs mean more bugs and harder reviews. A leading indicator of quality erosion.
    Source: GitHub’s PR data (additions/deletions). Multitudes and LinearB track trends over time.

  • Change Failure Rate (CFR). What percentage of deployments cause production incidents? If AI-generated code is shipping faster but breaking more often, CFR is where that shows up.
    Source: Sleuth calculates this from actual deployment events. LinearB and Multitudes also track it as a core DORA metric.

  • Mean Time to Repair (MTTR). How quickly can your team fix bugs that make it to production? If AI is helping developers write code faster but that code is harder to debug or introduces more defects, MTTR will spike. Especially important to track alongside PR size increases: verbose AI-generated code in larger PRs can make bugs harder to locate and fix.
    Source: Track incident/bug tickets from creation to resolution in Jira, Linear, or GitHub issues. Sleuth ties this directly to deployment events.

  • Out-of-hours work. A 19.6% increase in after-hours commits is a sustainability metric. If “productivity gains” come from people just working more, that’s not an AI win, it’s a possible burnout risk.
    Source: Multitudes tracks this against configured working hours. Or pull commit timestamps from Git and bucket them using the commit author’s timezone.

These metrics matter because the quality evidence keeps growing. Carnegie Mellon studied 800+ repos using Cursor and found static analysis warnings rose 30% post-adoption. GitClear’s analysis of 153M+ lines found 41% higher code churn in AI-generated code. The code compiles. It passes basic tests. But it creates maintenance debt that shows up months later.

If velocity goes up but quality goes down, you don’t have a productivity gain. You have a delayed cost.

The Quality Lever Most Leaders Underestimate

Here’s the most encouraging finding from the research: one organization actually decreased PR sizes by 8.5% after their AI rollout while everyone else saw increases.

How? Two things: strong code review norms and clear expectations from leadership. Their leaders explicitly told teams they wouldn’t review PRs that were too long. So engineers figured out how to use AI to write tighter code instead of more code. They ran AI reviews before human reviews to catch obvious issues. The culture pointed the tool in the right direction.

The takeaway: with the right norms, leaders can steer AI toward quality, not just speed. DORA’s findings back this up. The teams that benefited most from AI were the ones that already had strong engineering fundamentals in place.

How to Actually Implement This

Start with three metrics. Pick one from each tier: active weekly usage rate (adoption), merge frequency delta (velocity) and MTTR (quality). Run for 30 days against your pre-AI baselines. Layer in more based on what questions come up.

Missed the baseline window? Most teams did. Tools like Multitudes, LinearB, or Sleuth can retroactively calculate metrics from your Git history. Your commit data is your baseline. If that’s not possible, start now and compare quarter-over-quarter.

Instrument, don’t surveil. Frame metrics as team-level learning, not individual tracking. The moment someone sees “Developer X accepted 40% fewer AI suggestions than Developer Y”, trust evaporates.

Watch for hidden costs. Delivery pressure actually slowed AI adoption among senior engineers in the Multitudes research. Under crunch, seniors reverted to familiar workflows while juniors kept experimenting. If you’re rolling out AI during a high-pressure period, your adoption numbers may be misleading.

Report outcomes, not activity. Don’t say “Copilot usage is up 30%.” Say “Merge frequency increased 27% and PR sizes held steady, while out-of-hours work stayed flat.” That’s a sentence that gives a better idea of the results of using AI tools.

Leadership Action Item of the Week

Before your next planning cycle, ask yourself: Am I measuring AI activity, or AI outcomes and do I know the hidden costs alongside the gains?

The organizations pulling ahead are tracking productivity, quality and team wellbeing together. Pick one metric from each tier. Share them with your team next week, not as a mandate, but as a conversation starter about what AI is actually doing for you. Revisit in 30 days.

The leaders who figure out AI measurement now will be able to make decisions about where to double down and where to pull back earlier in the game.

What’s Next?

  • Setting Clear Expectations in Growing Teams

  • How to Scale Without Burning Out Your ICs

  • Velocity vs Durability: Pick Both

  • AI in Interviews: What’s Working and What’s Not

  • Code Reviews That Actually Catch What Matters

  • Developer Productivity Beyond AI Coding Tools

Want something covered? Hit reply and tell me. I love hearing what you’re dealing with.

Work With Me

Resume Review

A detailed review of your resume with specific, actionable feedback to strengthen your story, highlight impact, and position you for Engineering IC or Leadership roles.

Mock Interviews

A practice session tailored to Engineering IC or Leadership roles. You’ll get structured feedback, real scenarios, and clarity on what interviewers actually look for.

1:1 Mentorship

A session focused on your career growth, navigating leadership challenges and building a roadmap toward your next role.

📬 Reply back to this email to book a 30 min session (free for subscribers!)

Meme of The Week

Have you tried OpenClaw yet? Be careful of the security risks! 💀

When it all clicks.

Why does business news feel like it’s written for people who already get it?

Morning Brew changes that.

It’s a free newsletter that breaks down what’s going on in business, finance, and tech — clearly, quickly, and with enough personality to keep things interesting. The result? You don’t just skim headlines. You actually understand what’s going on.

Try it yourself and join over 4 million professionals reading daily.

That’s a wrap for this week’s issue of CodingBeenz! 👩‍💻

If your AI metrics dashboard only shows activity, you’re telling half the story. Track adoption, velocity, and quality together. That’s how you protect your budget and your team.

Until next time,

Sabeen

P.S.

If you’re new here, grab your virtual beanbag, settle in, and feel free to share this with a fellow leader trying to prove their AI investment is actually working. 💡

Keep reading