I kept hearing the same pattern: some teams are shipping 10-15 AI PRs daily without issues. Others tried once, broke production, and gave up entirely.
The difference wasn't what I expected– it wasn't about model choice or prompt engineering.
---
One team shipped an AI-generated PR that took down their checkout flow.
Their tests and CI passed, but AI had "optimized" their payment processing by changing `queueAnalyticsEvent()` to `analytics.track()`. The analytics service has a 2-second timeout so when it's slow, payment processing times out.
In prod, under real load, 95th percentile latency went from 200ms to 8 seconds. Ended up with 3hh of downtime and $50k in lost revenue.
Everyone on that team knew you queue analytics events asynchronously, but that wasn't documented anywhere. It's just something they learned when analytics had an outage years ago.
*The pattern*
Traditional CI/CD catches syntax errors, type mismatches, test failures.
The problem is that AIs don't make these mistakes. (Or at least, tests and lints catch them before they get committed). The problem with AI is that it generates syntactically perfect code that violates your system's unwritten rules.
*The institutional knowledge problem*
Every codebase has landmines that live in engineers' heads, accumulated through incidents.
AIs can't know these, so they fall into the traps. It's then on the code reviewer to spot them.
*What the successful teams do differently*
They write constraints in plain English. Then AI enforces them semantically on every PR. Eg. "All routes in /billing/* must pass requireAuth and include orgId claim"
AI reads your code, understands the call graph, and blocks merges that violate the rules.
*The bottleneck*
When you're shipping 10x more code, validation becomes the constraint; not generation speed.
The teams shipping AI at scale aren't waiting for better models. They're using AI to validate AI-generated code against their institutional knowledge.
The gap between "AI that generates code" and "AI you can trust in production" isn't about model capabilities, it's about bridging the institutional knowledge gap.
Some teams already run an entirely AI-driven pipeline where AI writes the code, reviews it, and humans just click Merge at the end.
Interestingly, the AI reviewer often finds errors in AI-generated code. For now, AIs can review AIs code with OK results – probably better than a junior-mid level dev.
It got me thinking of the future…
1. Junior developer growth*:* Traditionally, code reviews are how junior developers learn. Have you noticed changes in how they learn since codegen has become a thinh? 2. Society expects self-driving cars to be dramatically safer (e.g. 10x better) before accepting them. Do you expect similar standards from AI code reviewers, or is slightly better than human good enough? 3. As AI surpass humans at writing and reviewing code (at least for technical correctness), how do you see the role of code reviews changing? 4. Philosophically, do you think code generation AIs will ever become so effective that specialized AI code reviewers won’t find any issues?
Some teams already run an entirely AI-driven pipeline where AI writes the code, reviews it, and humans just click Merge at the end.
Interestingly, the AI reviewer often finds errors in AI-generated code. For now, AIs can review AIs code with OK results – probably better than a junior-mid level dev.
Qs:
1. Traditionally, code reviews are how junior developers learn. Have you noticed changes in how they learn since codegen has become a thing?
2. Society expects self-driving cars to be dramatically safer (e.g. 10x better) before accepting them. Do you expect similar standards from AI code reviewers, or is slightly better than human good enough?
3. As AI surpass humans at writing and reviewing code (at least for technical correctness), how do you see the role of code reviews changing?
4. Philosophically, do you think code generation AIs will ever become so effective that specialized AI code reviewers won’t find any issues?
Here’s a demo video: https://www.youtube.com/watch?v=pglEoiv0BgY
We (Allis and Paul) are engineers who faced this problem when we worked together at our last startup. Code review quickly became our biggest bottleneck—especially as we started using AI to code more. We had more PRs to review, subtle AI-written bugs slipped through unnoticed, and we (humans) increasingly found ourselves rubber-stamping PRs without deeply understanding the changes.
We’re building mrge to help solve that. Here’s how it works:
1. Connect your GitHub repo via our Github app in two clicks (and optionally download our desktop app). Gitlab support is on the roadmap!
2. AI Review: When you open a PR, our AI reviews your changes directly in an ephemeral and secure container. It has context into not just that PR, but your whole codebase, so it can pick up patterns and leave comments directly on changed lines. Once the review is done, the sandbox is torn down and your code deleted – we don’t store it for obvious reasons.
3. Human-friendly review workflow: Jump into our web app (it’s like Linear but for PRs). Changes are grouped logically (not alphabetically), with important diffs highlighted, visualized, and ready for faster human review.
The AI reviewer works a bit like Cursor in the sense that it navigates your codebase using the same tools a developer would—like jumping to definitions or grepping through code.
But a big challenge was that, unlike Cursor, mrge doesn’t run in your local IDE or editor. We had to recreate something similar entirely in the cloud.
Whenever you open a PR, mrge clones your repository and checks out your branch in a secure and isolated temporary sandbox. We provision this sandbox with shell access and a Language Server Protocol (LSP) server. The AI reviewer then reviews your code, navigating the codebase exactly as a human reviewer would—using shell commands and common editor features like "go to definition" or "find references". When the review finishes, we immediately tear down the sandbox and delete the code—we don’t want to permanently store it for obvious reasons.
We know cloud-based review isn't for everyone, especially if security or compliance requires local deployments. But a cloud approach lets us run SOTA AI models without local GPU setups, and provide a consistent, single AI review per PR for an entire team.
The platform itself focuses entirely on making human code reviews easier. A big inspiration came from productivity-focused apps like Linear or Superhuman, products that show just how much thoughtful design can impact everyday workflows. We wanted to bring that same feeling into code review.
That’s one reason we built a desktop app. It allowed us to deliver a more polished experience, complete with keyboard shortcuts and a snappy interface.
Beyond performance, the main thing we care about is making it easier for humans to read and understand code. For example, traditional review tools sort changed files alphabetically—which forces reviewers to figure out the order in which they should review changes. In mrge, files are automatically grouped and ordered based on logical connections, letting reviewers immediately jump in.
We think the future of coding isn’t about AI replacing humans—it’s about giving us better tools to quickly understand high-level changes, abstracting more and more of the code itself. As code volume continues to increase, this shift is going to become increasingly important.
You can sign up now (https://www.mrge.io/home). mrge is currently free while we're still early. Our plan for later is to charge closed-source projects on a per-seat basis, and to continue giving mrge away for free to open source ones.
We’re very actively building and would love your honest feedback!