Agents Supersede the Reviewer, Not the Review

June 24, 2026 12 min read

ai agents code-review software-engineering autonomous-coding human-ai engineering

In June 2026, Martin Monperrus, a software-engineering researcher known for automated program repair, published a paper titled The End of Code Review: Coding Agents Supersede Human Inspection. The argument is that coding agents have crossed a capability threshold where having a human examine a diff before merge is no longer a necessary quality gate, and that the common setup where agents write code and humans stay the mandatory reviewers is a dead end.¹

The paper is right about more than its critics will admit, and wrong in one specific place that matters. Agents have superseded the reviewer: the human who reads a diff line by line looking for defects is doing a job an ensemble of agents now does better and on every commit. But the paper conflates that role with the review itself. When you actually run the agent pipeline it prescribes, the human work does not disappear. It relocates, from inspecting the code to owning the intent the code was supposed to satisfy. I run that pipeline. The reviewer is dying. The review is moving up the stack.

I want to take the paper seriously, because most responses to it will not. The reflexive reply is “but agents hallucinate,” and Monperrus already grants that. The honest engagement starts by conceding what he gets right.

TL;DR

Monperrus argues coding agents have ended the need for human code review, because every goal of review (defect detection, style, security, knowledge transfer) is served better and cheaper by agents, and human review capacity cannot scale with agent-driven throughput.¹
He is correct that the mandatory human-approval checkbox is finished, and correct that agents do systematic inspection better than a tired human skimming a large diff.
He is not naive about it: the paper concedes hallucination, prompt injection, security blind-spot correlation, and reserves humans for high-risk, novel, regulated, and ethical changes.¹
The gap is that he treats the residual human role as a small escalation set. In production it is the load-bearing center: the agent optimizes for the spec it was given, and writing and owning that spec is the irreducibly human act.
The reviewer role is being automated. The review, understood as judgment about whether the software is correct for its purpose, is relocating to where the agent cannot follow.

What the Paper Gets Right

Monperrus builds on Bacchelli and Bird’s enumeration of why teams review code: defect detection, style and standards enforcement, knowledge transfer, and team awareness, with security as a fifth dimension.¹² His move is to take each goal and argue an agent serves it better. Agents inspect every commit without fatigue or time-zone delay. They enumerate vulnerability classes more systematically than a human doing an ad-hoc pass. They generate architectural summaries and updated documentation at merge time. The paper marshals the SWE-bench capability curve to make the threshold case, from the best model resolving under 2 percent of real GitHub issues when the benchmark launched in 2023 to top agents exceeding 70 percent by late 2025.¹³

I have no quarrel with this part, because I watch it work daily. My autonomous build loop runs a three-reviewer gate: separate agents check correctness, conventions, and security before code merges, and a second loop sends the implementation to an independent model for an adversarial pass. Those agents catch real defects, and they catch them on every change, not on the changes a human had time for. The two posts that preceded this one on this site each passed an agent evaluator that scored them against a rubric and flagged specific factual problems I then had to fix. The paper’s claim that agents produce actionable, structured review output comparable to a trained reviewer is not speculative to me. It is my Tuesday.

The throughput argument is also correct, and it is the part people underrate. An agent-assisted developer produces more pull requests per day than human review capacity can absorb. When the writer is fast and the reviewer is a human, the review queue becomes the binding constraint, and review degrades into a formality performed under time pressure.¹ Monperrus is right that the naive arrangement, agents write and a human rubber-stamps, provides no real assurance. A human who approves because the code looks correct and the tests pass is not reviewing. He is signing.

The Pipeline He Describes Is the One I Run

What the paper proposes to replace human review is not “trust one model.” It is an agent-in-the-loop verification pipeline: multiple independent agents, ideally different models, producing calibrated, structured sign-off (test coverage, security scans, reasoning traces as JSON or SARIF, the standard interchange format for static-analysis results) rather than informal comment threads, with agents instructed to abstain when uncertain and humans reserved for the hard cases.¹

That is, with different names, the architecture I have been building and writing about for a year. I have argued that agent pull requests need smaller review surfaces, that automated review needs dissent rather than a single confident judge, and that review packets of structured evidence are replacing the informal diff comment. So I am not arguing against the pipeline. I helped make the case for it. I am arguing about what is left for the human once the pipeline exists, because I have lived in the answer, and it is not the answer the paper gives.

Where the Argument Breaks: Review Was Never Only Inspection

Monperrus reserves the human for high-risk changes, novel architecture, regulated code paths, and ethical judgment, and he frames these as escalation: exceptions routed to a person when the agents flag them.¹ The framing makes the human role sound like a rare interrupt on an otherwise automated line.

Running the line teaches the opposite. The agent does not generate its own purpose. It optimizes for the specification it is handed, and on every change that matters, someone has to decide what correct means before the agents can check anything against it. The paper itself admits the boundary in its discussion section: agents optimize for technical quality metrics and are not reliably equipped to notice that a telemetry change violates a user’s reasonable privacy expectation, or that a ranking tweak amplifies bias.¹ That is presented as a limitation at the edges. It is not at the edges. The question “is this change correct for what we actually want” sits at the center of every non-trivial merge, and it is exactly the question an agent calibrated to a spec cannot ask about the spec.

I felt this concretely on the two posts I shipped before this one. The agent reviewer scored them and caught a factual overreach in each: an unverified institutional claim in one, a misattributed statistic in the other. The catch was the agent’s. The fix was not. Deciding how to correct an overreach truthfully, which source actually supported the claim, what the honest version of the sentence was, required judgment about intent that the rubric could flag but not resolve. The agent found that something was wrong. A human decided what right looked like. That division of labor is the relocation, and it happened on routine content, not a regulated edge case.

So the human does not leave the loop. The human moves from the end of it to the start. Review used to be the last checkpoint, a person inspecting finished code. In an agent pipeline the inspection is automated and the irreducible human work moves to the front: specifying intent precisely enough that the agents have something true to verify against, and owning the consequences when the shipped result meets the spec but misses the point. Accountability cannot be delegated to a system that optimizes for metrics, because accountability is the willingness to be wrong on purpose and answer for it.

The Honest Version of the Claim

Strip the provocation from the title and the defensible claim is narrower than “the end of code review.” The defensible claim is the end of the human as diff inspector and mandatory approval checkbox. That role is genuinely finished, and pretending otherwise to protect a comfortable ritual is its own dishonesty. Teams that keep a human in the inspection seat as theater, approving agent code they cannot actually scrutinize, have already lost the assurance they think they have.

But “code review” was always a proxy word. It named a checkpoint and meant a judgment: does this change do what we need, safely, in a way we can stand behind. Automate the checkpoint and the judgment does not evaporate. It relocates to intent specification on the way in and accountability on the way out, and on a team moving at agent speed it becomes more important, not less, because the agents will faithfully and rapidly build whatever the spec says, including the wrong thing. The faster the writer, the more the bottleneck becomes knowing what to ask for. Monperrus is right that the reviewer is being superseded. He is wrong that the review is ending. It is moving to the one seat the agent cannot occupy.

Key Takeaways

For engineering leaders: - Stop staffing human review as diff inspection. Agents do that better and continuously; a human approval checkbox on agent code is assurance theater. - Reallocate that human capacity to intent specification and accountability, the parts of review that determine whether correct-to-spec is correct-in-fact.

For developer-tool builders: - Build the ensemble-review pipeline the paper describes: multiple models, calibrated abstention, structured sign-off. The dissent between reviewers is the signal. - Design the front of the pipeline, not only the gate. The highest-value surface is where a human turns intent into a spec the agents can verify against.

For engineers: - Your review skill is not becoming worthless; it is changing address. The value moves from spotting the bug in the diff to defining what the code was supposed to do and owning the result.

FAQ

Does this paper mean human code review is over?

The human as line-by-line diff inspector and mandatory approver is over, which is the paper’s strongest point: agents do systematic inspection better and on every commit. What does not end is the judgment that code review was a proxy for, namely whether a change is correct for its actual purpose. That judgment relocates to specifying intent and owning consequences rather than disappearing.

What does Monperrus actually argue?

That coding agents now serve every stated goal of code review (defect detection, style, knowledge transfer, security) at lower cost and higher throughput, and that keeping humans as the mandatory reviewers of agent-written code is a dead end because it gives no real assurance and cannot scale. He proposes an agent ensemble producing structured sign-off, with humans reserved for high-risk and ethical cases. It is a position paper, not an empirical study.¹

Where is the argument weakest?

In treating the residual human role as a rare escalation. In practice the human role is load-bearing on every non-trivial change, because the agent optimizes for a spec it cannot author or question. Defining the spec and answering for the outcome is central work, not an edge case.

Should teams keep a human approval step on agent pull requests?

Not as inspection theater. If the human cannot genuinely scrutinize the change, the approval is a signature, not a review. Better to invest the human effort upstream, in specifying intent precisely, and downstream, in owning the shipped result, while letting an agent ensemble do the inspection.

Sources

Martin Monperrus, “The End of Code Review: Coding Agents Supersede Human Inspection,” arXiv, June 11, 2026: arxiv.org/abs/2606.13175. A position paper synthesizing existing capability evidence; it enumerates code-review goals from Bacchelli and Bird, cites the SWE-bench capability curve, and discusses limitations including hallucination, prompt injection, and ethical accountability.
Alberto Bacchelli and Christian Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” ICSE 2013, the empirical source for the review-goal taxonomy the paper builds on: Microsoft Research
Carlos E. Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” ICLR 2024, the benchmark behind the capability curve (best model resolved 1.96% at launch): arxiv.org/abs/2310.06770
Related writing on agent review from production experience: smaller review surfaces, review needs dissent, review packets, and the autonomous build loop whose three-reviewer gate is the pipeline this post describes running.

Martin Monperrus, “The End of Code Review: Coding Agents Supersede Human Inspection,” arXiv:2606.13175 (June 11, 2026). The paper enumerates code-review goals (defect detection, style and standards, knowledge transfer, team awareness, plus security), argues agents serve each at lower cost and higher throughput, and makes two claims against the agents-write/humans-review arrangement: it provides no genuine assurance because humans rubber-stamp plausible code, and it does not scale because review capacity becomes the bottleneck. It proposes an agent-in-the-loop pipeline (ensemble review, calibrated abstention, structured JSON/SARIF sign-off) with human escalation reserved for high-risk, novel, regulated, and ethical changes, and it explicitly identifies its own limitations including hallucination, security blind-spot correlation, prompt injection, and the inability of metric-optimizing agents to make ethical judgments. The author states it is a position paper, not a new empirical study. ↩↩↩↩↩↩↩↩↩↩
Alberto Bacchelli and Christian Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” Proceedings of the 2013 International Conference on Software Engineering (ICSE 2013), 712-721. The empirical study, based on observation, interviews, and surveys of developers at Microsoft, found that the stated motivation of review (finding defects) is often outranked in practice by knowledge transfer and team awareness, the taxonomy Monperrus builds his goal-by-goal argument on. ↩
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan, “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” ICLR 2024, arXiv:2310.06770. At the benchmark’s introduction the best model (Claude 2) resolved 1.96% of the 2,294 real GitHub-issue tasks; by late 2025 top agents exceeded 70% on the public leaderboard, the capability curve the paper uses to argue the threshold has been crossed. ↩