We’re building Weave: an ML-powered tool to measure engineering output, that actually understands engineering output!
Why? Here’s the thing: almost every eng leader already measures output - either openly or behind closed doors. But they rely on metrics like lines of code (correlation with effort: ~0.3), number of PRs, or story points (slightly better at ~0.35). These metrics are, frankly, terrible proxies for productivity.
We’ve developed a custom model that analyzes code and its impact directly, with a far better 0.94 correlation. The result? A standardized engineering output metric that doesn’t reward vanity. Even better, you can benchmark your team’s output against peers while keeping everything private.
Although this one metric is much better than anything else out there, of course it still doesn't tell the whole story. In the future, we’ll build more metrics that go deeper into things like code quality and technical leadership. And we'll build actionable suggestions on top of all of it to help teams improve and track progress.
After testing with several startups, the feedback has been fantastic, so we’re opening it up today. Connect your GitHub and see what Weave can tell you: https://app.workweave.ai/welcome.
I’ll be around all day to chat, answer questions, or take a beating. Fire away!
This metric has no opinion on the nature of that functionality (i.e. it's not evaluating product decisions). So it doesn't tell the whole story, but it tells a much more accurate story than LOC or whatever other metrics people are using currently!
But if the metric is a PROXY for the goal, then the metric becomes the objective (not the actual goal).
In this case, whatever this AI is measuring is going to be exactly what every dev drops into their Github Copilot prompt instructions.
Metrics are harder on software engineers, the good ones delete code, the best ones make sure useless code never gets written in the first place. How do you measure that?
https://github.com/PostHog/posthog/pull/25056: 15.266 (Adds backend, frontend, and tests for a new feature)
https://github.com/microsoft/vscode/pull/222315: 8.401 (Refactors code to use a new service and adds new tests)
https://github.com/facebook/react/pull/27977: 5.787 (Small change with extensive, high effort tests; approximately 1 day of work for expert engineer)
https://github.com/microsoft/vscode/pull/213262: 1.06 (Mostly straightforward refactor; well under 1 day of work)
For example, the first PR is correlated with ~15 "hours of work for an expert engineer"
Looking at the PR, it was opened on Sept 18th and merged on Oct 2nd. That's two weeks, or 10 working days, later.
Between the initial code, the follow up PR feedback, and merging with upstream (8 times), I would wager that this took longer than 15 hours of work on the part of the author.
It doesn't _really_ matter, as long as the metrics are proportional, but it may be better to refer to them as isolated complexity hours, as context-switching doesn't seem to be properly accounted for.
However ultimately the meaning isn't the absolute number but rather the relative difference (e.g. from PR to PR, or from team to team) - that's why we show industry benchmarks and make it easy to compare across teams!
What we don't capture is any product or communication overhead - however our platform has other metrics which can help find if these are causing inefficiencies :)
Building this has been a wild ride. The challenge of measuring engineering output in a way that’s fair and useful is something we’ve thought deeply about—especially because so many of the existing metrics feel fundamentally broken.
The 0.94 correlation is based on rigorous validation with several teams (happy to dive into the details if anyone’s curious). We’re also really mindful that even the best metrics only tell part of the story—this is why our focus is on building a broader set of signals and actionable insights as the next step.
Would love to hear your thoughts, feedback, or even skepticism—it’s all helpful as we keep refining the product.
> Is it productive if I have tried many changes in branches and none of them made it to prod?
Our metric measures displacement, not distance - under the assumption that the end state is the part that matters the most. It will notice if the resulting change has a higher cognitive load and evaluate it accordingly - but if there is no resulting change then ultimately there's no output to measure.
Let me know if you have any questions that aren't answered there!
> We’ve developed a custom model that analyzes code and its impact directly...
This is a bold claim all things considering. Don't you need to fine tune this model for every customer as their business metrics likely vastly different? How do you measure the impact of refactoing? What about regressions or design mistakes that surface themselves after months or even years?
> How do you measure the impact of refactoing?
The metric def gives credit for refactoring
> What about regressions or design mistakes that surface themselves after months or even years?
Not captured (part of why it's only an important part of the story, not the whole story :))
I totally get this - that's how I felt initially, but I was shocked to find that the vast majority of orgs are using bad metrics like LOC or commit counts anyway. Our belief is that replacing those with something much more accurate can help the entire industry.
How am I going to do that with your blackbox metrics when this need arises?
Also I don't have a Google account, so I can't even get pass your frontpage that has no info?
But you see, the AI scored your productivity at 47%, barely "meets expectations", while we expect everyone to score at least 72%, "exceeds expectations". How is that calculated? The AI is a state of the art proprietary model, I don't know the details...
Anyways, we've got to design a Personal Improvement Plan for you. Here's what our AI recommends. We'll start with the TPS reports..."
To be clear we're not claiming this is 1 number to holistically evaluate an entire engineer. Rather we're giving a much more accurate picture of output, which most orgs are already measuring (with terrible accuracy). It should be an important part of the picture but certainly not the whole story!
And fwiw I think the scores are pretty transparent - in the platform, you can drill down into any number and see the actual PRs and their output measurements. Of course the underlying model is more complex but unfortunately simpler models are not sufficient to capture the way engineering output works.
Is this generally just sniffing surface quality and quantity of written code, or is consideration given to how architecturally sound the system is built, whether the features introduced and their implementations make sense, how that power is exposed to users and whether the UI is approachable and efficient, user-feedback resulting from the effort, long-term sustainability and technical debt left behind (inadvertently or with deliberation), healthy practices for things like passwords & sensitive data, etc?
I'm glad to see an effort at capturing better metrics, but my own feeling is trying to precisely measure developer productivity is like trying to measure IQ - it's a flawed errand and all you wind up capturing is one corner of a larger picture. Your website shares zero information prior to login, and I'm looking forward to you elaborating a little more on your offering!
EDIT: Would also love to hear feedback from developers at the startups you tested at - did they like it and felt it better reflected their efforts during periods they felt productive vs. not, was there any initial or ongoing resistance & skepticism, did it make managers more aware of factors not traditionally captured by the alternative metrics you mentioned, etc.
Evaluated on a proprietary data set of manually labelled PRs
> Is this generally just sniffing surface quality and quantity of written code...
Somewhere in between the two :) a PR with a poorly and quickly implemented login will have a lower output score than a PR with a robust, well-designed and tested login, simply because the latter is more effort. But there isn't (yet!) a metric to quantify the relative quality. So our metric doesn't tell the full story, but it gives more info than would have previously been available.
Some of such changes have been my most impactful ones.
https://blog.pragmaticengineer.com/the-product-minded-engine...
The information for effort is not available at the code level - sorry to burst your bubble.
But to your broader point, I think there certainly is information about effort at the code level. Consider for example these two PRs: https://github.com/PostHog/posthog/pull/23858 and https://github.com/microsoft/vscode/pull/209557. It's pretty easy to tell which one was more effort even if you don't know anything about the process for how they were implemented.
Do you have any shareable examples you want me to test out? Or of course you can try it yourself :)