Code Review for Claude Code (opens in new tab)

(claude.com)

83 pointsadocomplete3mo ago48 comments

48 comments

48 comments · 15 top-level

CharlesW3mo ago· 13 in thread

Interesting: "Reviews are billed on token usage and generally average $15–25, scaling with PR size and complexity."

This cost seems wild. For comparison GitHub Copilot Code Review is four cents per review once you're outside of the credits included with your subscription.

8cvor6j844qw_d63mo ago

Same thoughts.

For comparison, Greptile charges $30 per month for 50 reviews, with $1 per additional review.

At average of $15~25 per review, this is way more expensive.

1 more reply

SkyPuncher3mo ago

Yea, but copilot review is useless. The noise it generates easily costs that much in wasted time.

satvikpendem3mo ago

Not sure about that, I find it's generally pretty good at finding niche issues that aren't as easily caught by humans. Especially with newer LLMs it gets even better.

Bnjoroge3mo ago

eh works fine for me, much better than I expected.

gnorst3mo ago

I don't know how good Claude's reviews are but I have yet to get a worthwhile GitHub Copilot review.

SkyPuncher3mo ago

Senior+ engineers easily make $100+ an hour. This is equivalent to 15 minutes of their time max.

I run a PR review via Claude on my own code before I push. It’s exceptionally good. $20 becomes an incredibly easy sell when I can review a PR in 10 minutes instead of an hour.

Twixes3mo ago

Average _per review_? Insane costs, that's potentially thousands per developer. Am I missing something?

remus3mo ago

I haven't used it so just spit balling, but surely it depends on the quality of the review? If it picks up lots of issues and prevents downtime then it could work out as worthwhile. What would it cost an engineer with deep knowledge of the codebase to do a similar job? You could spend an hour really digging into a PR, poking around, testing stuff out etc. Im guessing most engineers are paid more than $15-25/hr, not to mention the opportunity cost.

duskdozer3mo ago

Now imagine what it will be when they actually need to make money

karmakaze3mo ago

At those prices I wonder if it also reviews the design for ineffectiveness in performance or decomposition into maintainable units besides catching the bugs.

Also the examples are weird IMO. Unless it was an edge/corner case the authentication bug would be caught in even a smoke test. And for the ZFS encryption refactor I'd expect a static-typed language to catch type errors unless they're casting from `void*` or something. Seems like they picked examples by how important/newsworthy the areas were than the technicality of the finds.

alexsmirnov3mo ago

This mostly matches my own estimates for pr-review command that I use. But it's pretty sophisticated: 6 specialized agents, best practices skills, CVE database, bunch of scripts. To reduce cost, most of agents use cheap open source models.

atonse3mo ago

Wait, what? So if I'm a paying Max user, i'd still have to pay more? Don't see the value. Would rather have a repo skill to do the code review with existing Claude Max tokens.

Bnjoroge3mo ago· 6 in thread

what are the implications for the tens of code review platforms that have recently raised on sky high valuations?

satvikpendem3mo ago

Same as all the other companies that built on top of the API and then were obsoleted after the API provider made it a built-in feature.

https://finance.yahoo.com/news/claude-just-killed-startup-sf...

sixothree3mo ago

I'm guessing people need to quickly realize Claude is a platform.

Bnjoroge3mo ago

bitter lesson applied to platforms

8cvor6j844qw_d63mo ago

Reminds me of a post here (or maybe it was on another forum) about someone who built a tool for their own use but deliberately chose not to develop it into a product, citing insufficient moat. Seems like they read the room correctly.

lbreakjai3mo ago

There might still be plenty of room to compete, especially at $15-$25 per review. I'm starting to feel like the right harness makes more difference than the right model.

The real competition for both claude and the platforms is a skill running locally against the very same code.

Bnjoroge3mo ago

agree on it potentially being a big market, just not sure it's big enough for multiple unicorn startups trying to justify their valuation. competing on price, especially when code review is the main bottleneck seems misguided imo

lowsong3mo ago· 4 in thread

> Reviews are billed on token usage and generally average $15–25, scaling with PR size and complexity.

You've got to be completely insane to use AI coding tools at this point.

This is the subsidised cost to get users to use it, it could trivially end up ten times this amount. Plus, you've got the ultimate perverse incentive where the company that is selling you the model time to create the PRs is also selling you the review of the same PR.

rorychatt3mo ago

The bet is that compute gets cheap enough before the crunch that it won't matter. You should model it at 10x - but you also need to factor in NPV and opportunity cost. Even if pricing spikes later, the value extracted at today's rates might still put you ahead overall.

The relevant comparison for most enterprise isn't whether $15/PR is subsidised - it's whether it beats the alternative. For most shops that's cheap offshore labour plus the principal engineer time spent reviewing it, managing it, and fixing what got merged anyway. Most enterprise code is trivial CRUD - if the LLM generates it and reviews it to an equivalent standard, you're already ahead.

slopinthebag3mo ago

Nah, you're insane if you totally change your workflow to the point where you're reliant on them and your skills atrophy though.

MattDamonSpace3mo ago

Depends on the skills, I can’t read assembly

eddiekm3mo ago

you dont pay to compile to assembly either

xlii3mo ago· 3 in thread

> We've been running Code Review internally for months: on large PRs (over 1,000 lines changed), 84% get findings, averaging 7.5 issues. On small PRs under 50 lines, that drops to 31%, averaging 0.5 issues. Engineers largely agree with what it surfaces: less than 1% of findings are marked incorrect.

So the take would be that 84% heavily Claude driven PRs are riddled with ~7.5 issues worthy bugs.

Not a great ad of agent based development quality.

jgraettinger13mo ago

I ask Claude or codex to review staged work regularly, as part of my workflow. This is often after I’ve reviewed myself, so I’m asking it to catch issues I missed.

It will _always_ find about 8 issues. The number doesn’t change, but it gets a bit … weird if it can’t really find a defect. Part of the art of using the tool is recognizing this is happening, and understanding it’s scraping the bottom of its barrel.

However, if there _are_ defects, it’s quite good at finding and surfacing them prominently.

Kuxe3mo ago

How many bugs do a human introduce in 1000 line PRs and 50 line PRs?

slopinthebag3mo ago

Zero

simianwords3mo ago· 3 in thread

nice but why is this not a system prompt? what's the value add here?

NoahZuniga3mo ago

You're paying the same token rate for this as you would if it was just a system prompt. Clearly the scaffolding adds something.

(They mention their github action which seems more like a system prompt)

simianwords3mo ago

seems like a very small value add. why is this a blog post - i could do this myself.

sixothree3mo ago

Does this only work with github actions? What about Devops and gitlab?

higheun3mo ago· 1 in thread

Interesting to see this formalized. I've been running controlled experiments on why context separation improves LLM review quality — something I'm calling Cross-Context Review (CCR).

Setup: 30 artifacts (code, docs, scripts), 150 injected errors, 4 review conditions, 360 total reviews using Claude Opus 4.6.

Results:

- Cross-Context Review (artifact only, no production history): F1 28.6%

- Same-session self-review: F1 24.6% (p=0.008 vs CCR)

- Same-session repeated review (SR2): F1 21.7%

The SR2 result is the key finding — reviewing twice in the same session doesn't help (p=0.11 vs single review). The model generates more noise, not more signal. This rules out "two looks are better than one" as an explanation. It's the context separation itself that matters.

The gap is widest on critical errors: 40% detection for CCR vs 29% for same-session review.

Mechanism: production context introduces anchoring bias + sycophancy + context rot. A fresh session eliminates all three simultaneously by removing the conditioning tokens.

What Anthropic is doing here — dispatching independent agents that never saw the production context — is essentially this principle at industrial scale. Working on a paper but not published yet.

higheun3mo ago

Update: the paper is now on arXiv — https://arxiv.org/abs/2603.12123

"Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions"

nemo44x3mo ago· 1 in thread

So their business model is to deliver me buggy code and then charge me to fix it?

MattDamonSpace3mo ago

But like, a lot of it

nolanl3mo ago· 1 in thread

The concept of "AI will review AI-authored PRs" seems completely wrong to me. Why didn't the AI write the correct code in the first place?

If it takes 17 rounds of review from 5 different models/harnesses – I don't care. Just spit out the right code the first time. Otherwise I'm wasting my time clicking "review this" over and over until the PR is worth actually having a human look at.

raflueder3mo ago

Because the code generated is only as good as the initial description of what you want. It's not too different from "standard" coding where you have a first go at solving it and then iterate and polish as you go along.

I've had multiple situations where things "just worked" and at other times you just have to steer it in the right direction a few times, having another agent doing the review works really well (with the right guardrails), it's like having someone with no other intent or bias review your code.

Unless you're talking about "vibe coding" in which case "correct" doesn't really matter as you're not even looking at what the output is, just let it go back/forth until something that works comes out, I haven't had much success or even enjoyed it as much working this way, took me a couple of months to find the sweet spot (my sweetspot, I think it'll be different for everyone).

cpncrunch3mo ago· 1 in thread

Does AI review of AI generated code even make sense?

cpncrunch3mo ago

Just to be clear, I'm referring to using AI review instead of human review (not alongside it to find extra issues).

raflueder3mo ago

Or, just spin up your own review workflow, I've been doing this for the past couple of months after experimenting with Greptile and it works pretty well, example setup below:

https://gist.github.com/rlueder/a3e7b1eb40d90c29f587a4a8cb7c...

An average of $0.04/review (200+ PRs with two rounds each approx.) total of $19.50 using Opus 4.6 over February.

It fills in a gap of working on a solo project and not having another set of eyes to look at changes.

nnennahacks3mo ago

Yeah, the "$15-20 a PR is cheaper than a great engineer" idea is doing a lot of hand‑waving here...

If you're a big shop pushing, say, 2,000 PRs a week and reviews average $15–25, that’s on the order of $30k–$50k a week in AI review spend, or $1.5-2.5M a year. That is quite a line item to justify.

"It's $20 cheaper than a senior engineer’s hourly rate,"... so what are you actually doing with your human reviewers once you add this on?

If you keep your existing review culture and just bolt this on, then you've effectively said "we’re willing to add $1–2M+ a year to the budget." That might be fine, but then you should be able to point to fewer incidents, shorter lead times, higher coverage, something like that.

Either this is a replacement story (fewer humans, different risk profile) or it's an augmentation story (same humans, bigger bill, hopefully better outcomes). "It’s cheaper than a great engineer" by itself skips over the fact that, at scale, you’re stacking this cost on top of the engineers you already have in the org.

toniantunovi3mo ago

When a tool flags 8 issues on clean code and 8 issues on broken code, it's not a reviewer, it's a random number generator with a UI. The approach we've found more tractable is to separate concerns: let deterministic tools (linters, SAST, SCA) handle what they're definitively good at - style, known vuln patterns, dependency CVEs, secrets and reserve the AI layer for things humans actually need help reasoning about. Running this locally as a pre-push or CI step means you catch the boring 80% before it ever reaches a $25 AI review. You're not paying Claude to tell you your import is unused - you're paying it to reason about whether your auth flow has a TOCTOU issue. That's a very different and much more valuable question.

neuronexmachina3mo ago

I'm curious how this compares to just setting up a claude-code-action with one of Anthropic's existing code-review plugins:

* https://github.com/anthropics/claude-plugins-official/tree/m...

jjmarr3mo ago

I shipped this parallel agent workflow with validator agents as an internal tool months ago. It got a 20% reduction in time-to-merge, had a similar cost, and left 10x the comments as any existing AI review tool.

It's totally worth it.

denisdev13mo ago

My experience has been similar. LLM reviews are useful, but they tend to always produce findings. Even on small or very clean changes you wll still get a list of suggestions.

So part of the workflow becomes filtering signal vs noise.

j / k navigate · click thread line to collapse

48 comments

48 comments · 15 top-level

CharlesW3mo ago· 13 in thread

Interesting: "Reviews are billed on token usage and generally average $15–25, scaling with PR size and complexity."

cbovis3mo ago

This cost seems wild. For comparison GitHub Copilot Code Review is four cents per review once you're outside of the credits included with your subscription.

8cvor6j844qw_d63mo ago

Same thoughts.

For comparison, Greptile charges $30 per month for 50 reviews, with $1 per additional review.

At average of $15~25 per review, this is way more expensive.

1 more reply

SkyPuncher3mo ago

Yea, but copilot review is useless. The noise it generates easily costs that much in wasted time.

satvikpendem3mo ago

Not sure about that, I find it's generally pretty good at finding niche issues that aren't as easily caught by humans. Especially with newer LLMs it gets even better.

Bnjoroge3mo ago

eh works fine for me, much better than I expected.

gnorst3mo ago

I don't know how good Claude's reviews are but I have yet to get a worthwhile GitHub Copilot review.

SkyPuncher3mo ago

Senior+ engineers easily make $100+ an hour. This is equivalent to 15 minutes of their time max.

I run a PR review via Claude on my own code before I push. It’s exceptionally good. $20 becomes an incredibly easy sell when I can review a PR in 10 minutes instead of an hour.

Twixes3mo ago

Average _per review_? Insane costs, that's potentially thousands per developer. Am I missing something?

remus3mo ago

duskdozer3mo ago

Now imagine what it will be when they actually need to make money

karmakaze3mo ago

At those prices I wonder if it also reviews the design for ineffectiveness in performance or decomposition into maintainable units besides catching the bugs.

alexsmirnov3mo ago

atonse3mo ago

Wait, what? So if I'm a paying Max user, i'd still have to pay more? Don't see the value. Would rather have a repo skill to do the code review with existing Claude Max tokens.

Bnjoroge3mo ago· 6 in thread

what are the implications for the tens of code review platforms that have recently raised on sky high valuations?

satvikpendem3mo ago

Same as all the other companies that built on top of the API and then were obsoleted after the API provider made it a built-in feature.

https://finance.yahoo.com/news/claude-just-killed-startup-sf...

sixothree3mo ago

I'm guessing people need to quickly realize Claude is a platform.

Bnjoroge3mo ago

bitter lesson applied to platforms

8cvor6j844qw_d63mo ago

lbreakjai3mo ago

There might still be plenty of room to compete, especially at $15-$25 per review. I'm starting to feel like the right harness makes more difference than the right model.

The real competition for both claude and the platforms is a skill running locally against the very same code.

Bnjoroge3mo ago

lowsong3mo ago· 4 in thread

> Reviews are billed on token usage and generally average $15–25, scaling with PR size and complexity.

You've got to be completely insane to use AI coding tools at this point.

rorychatt3mo ago

slopinthebag3mo ago

Nah, you're insane if you totally change your workflow to the point where you're reliant on them and your skills atrophy though.

MattDamonSpace3mo ago

Depends on the skills, I can’t read assembly

eddiekm3mo ago

you dont pay to compile to assembly either

xlii3mo ago· 3 in thread

So the take would be that 84% heavily Claude driven PRs are riddled with ~7.5 issues worthy bugs.

Not a great ad of agent based development quality.

jgraettinger13mo ago

I ask Claude or codex to review staged work regularly, as part of my workflow. This is often after I’ve reviewed myself, so I’m asking it to catch issues I missed.

However, if there _are_ defects, it’s quite good at finding and surfacing them prominently.

Kuxe3mo ago

How many bugs do a human introduce in 1000 line PRs and 50 line PRs?

slopinthebag3mo ago

Zero

simianwords3mo ago· 3 in thread

nice but why is this not a system prompt? what's the value add here?

NoahZuniga3mo ago

You're paying the same token rate for this as you would if it was just a system prompt. Clearly the scaffolding adds something.

(They mention their github action which seems more like a system prompt)

simianwords3mo ago

seems like a very small value add. why is this a blog post - i could do this myself.

sixothree3mo ago

Does this only work with github actions? What about Devops and gitlab?

higheun3mo ago· 1 in thread

Interesting to see this formalized. I've been running controlled experiments on why context separation improves LLM review quality — something I'm calling Cross-Context Review (CCR).

Setup: 30 artifacts (code, docs, scripts), 150 injected errors, 4 review conditions, 360 total reviews using Claude Opus 4.6.

Results:

- Cross-Context Review (artifact only, no production history): F1 28.6%

- Same-session self-review: F1 24.6% (p=0.008 vs CCR)

- Same-session repeated review (SR2): F1 21.7%

The gap is widest on critical errors: 40% detection for CCR vs 29% for same-session review.

Mechanism: production context introduces anchoring bias + sycophancy + context rot. A fresh session eliminates all three simultaneously by removing the conditioning tokens.

What Anthropic is doing here — dispatching independent agents that never saw the production context — is essentially this principle at industrial scale. Working on a paper but not published yet.

higheun3mo ago

Update: the paper is now on arXiv — https://arxiv.org/abs/2603.12123

"Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions"

nemo44x3mo ago· 1 in thread

So their business model is to deliver me buggy code and then charge me to fix it?

MattDamonSpace3mo ago

But like, a lot of it

nolanl3mo ago· 1 in thread

The concept of "AI will review AI-authored PRs" seems completely wrong to me. Why didn't the AI write the correct code in the first place?

raflueder3mo ago

cpncrunch3mo ago· 1 in thread

Does AI review of AI generated code even make sense?

cpncrunch3mo ago

Just to be clear, I'm referring to using AI review instead of human review (not alongside it to find extra issues).

raflueder3mo ago

Or, just spin up your own review workflow, I've been doing this for the past couple of months after experimenting with Greptile and it works pretty well, example setup below:

https://gist.github.com/rlueder/a3e7b1eb40d90c29f587a4a8cb7c...

An average of $0.04/review (200+ PRs with two rounds each approx.) total of $19.50 using Opus 4.6 over February.

It fills in a gap of working on a solo project and not having another set of eyes to look at changes.

nnennahacks3mo ago

Yeah, the "$15-20 a PR is cheaper than a great engineer" idea is doing a lot of hand‑waving here...

"It's $20 cheaper than a senior engineer’s hourly rate,"... so what are you actually doing with your human reviewers once you add this on?

toniantunovi3mo ago

neuronexmachina3mo ago

I'm curious how this compares to just setting up a claude-code-action with one of Anthropic's existing code-review plugins:

* https://github.com/anthropics/claude-plugins-official/tree/m...

jjmarr3mo ago

It's totally worth it.

denisdev13mo ago

My experience has been similar. LLM reviews are useful, but they tend to always produce findings. Even on small or very clean changes you wll still get a list of suggestions.

So part of the workflow becomes filtering signal vs noise.

j / k navigate · click thread line to collapse