The Evolution of the Human Reviewer

For decades humans have ensured that complex software systems are built correctly, safely, and sustainably.

The arrival of AI coding tools has meant we can now generate working code at an unprecedented scale, but this raises a critical question for modern engineering organisations:

If we have strong specifications, automated testing, and AI implementation, to what extent do we still need human code review?

While AI can faithfully implement a specification, it cannot yet "feel" the weight of future maintenance.

Engineering the Intent

As AI continues to improve at astonishing rates, one thing remains consistent - outputs are only as good as the specifications, and specifications are only as good as the human who wrote them. When we ask AI to build a solution, we are framing the problem. The quality of the solution is capped by the clarity of the human’s intent and their ability to provide the right context. If the human fails to frame the problem correctly, the AI will execute a flawed vision.

This introduces a new level of accountability:

  • Context: AI lacks the holistic knowledge of why a system exists. The human must bridge the gap between business goals and technical constraints.
  • Clarity: You cannot automate a solution to a problem you don't fully understand. AI doesn't solve problems it implements solutions to the problems we think we’ve described.

Specifications have limits

A strong specification defines intended behaviour. It describes what the system should do. However a specification is simply a blueprint, its the code that's the actual building material. You cannot always see the structural integrity of the wood or the pouring of the concrete from the drawing alone.

Specifications rarely capture everything that matters in a real production system. They might struggle to convey:

  • Architectural boundaries and layering
  • Appropriate extension points in an existing codebase
  • Non-functional behaviours under failure
  • Operational characteristics
  • Future evolution and maintainability concerns

Even well-written specs leave room for interpretation. Implementation inevitably introduces new design decisions that were never explicitly described.

This means a system can be fully compliant with its specification while still introducing architectural risk.

Why Code Reveals What Design Cannot

When I code (pre-AI) something subtle but important always emerges during implementation: Certain design realities only become visible once code exists.

Before code is written, system diagrams make architecture appear clean. Once implementation starts, AI code looks plausible and passes tests, but may hide "hallucinations of intent."

When reviewing we should ask:

  • Did the AI interpret my "framing" of the problem correctly?
  • Are we creating new, hidden coupling across code areas / modules?
  • Are abstractions becoming harder to reason about because the prompt was too narrow?
  • Does the new feature extend the system in the right place?
  • Are we introducing complexity that future engineers will struggle to maintain?

These issues are difficult to predict purely through design documentation. They often become visible only when the concrete implementation is examined.

This is why code review exists, its purpose is not simply to verify correctness, it is to evaluate how the system is evolving.

Review fatigue

Reviewing code is harder than people like to admit.

Writing code gives you a real-time model in your head where you know the intent, the tradeoffs, what you ruled out, and where the risky parts are. Reviewing code written by someone else strips away most of that context. You are trying to reconstruct intent from artifacts. That is tiring even with human teammates, and it gets worse when the code arrives in large, polished-looking chunks via AI.

In AI-first coding where the workflow is “AI generates, human reviews,” fatigue is almost guaranteed. It's exactly the kind of work we are bad at for long stretches.

The failure becomes subtle:

  • The code looks coherent
  • The reviewer assumes the machine was probably right
  • The reviewer checks style and surface logic
  • Deeper architectural or domain mistakes slip through

Review fatigue also gets worse when AI output is:

  • Too large
  • Too polished
  • Too fast
  • Poorly connected to explicit requirements

That combination makes people over trust it. A 500 line confident looking diff is harder to review than code you watched emerge in 20 line steps.

So the practical implication is that if a team adopts some sort of spec-first workflow, it should redesign the review process, not just keep the old one with AI upstream. Good patterns are:

  • Smaller diffs
  • Explicit traceability from spec to code to tests
  • AI generated rationale next to non-obvious choices
  • Reviewer focus on edge cases and failure modes
  • Heavy reliance on automated checks, not human eyeballing alone

In other words, humans are not great at reviewing code they did not write, and AI can amplify that weakness unless the workflow becomes more interactive. Traditional techniques like pair programming were/are effective because it shares context during creation.

Finding the "Right Size"

Whatever approach I take to AI-first implementation I continue to find that the effectiveness of the AI-driven output is linked to the granularity of the task.

In a spec-driven model, failing to break down problems into good size increments can have unintended consequences. If an increment is too large, the AI produces a monolithic black box where design flaws are buried under sheer volume, making meaningful human review nearly impossible. If increments are too small, the system can become less cohesive, resulting in a fragmented architecture.

There is no "right size" for an increment of AI work. The optimal increment is determined by the situation, the complexity of the domain, the maturity of the codebase, the problem being solved and the specific capabilities of the model being used.

Mastering this is now a core requirement for engineers navigating a spec-first workflow.

AI doesn't change the Fundamentals

AI coding systems can implement specifications quickly and often produce syntactically correct, testable code. However, AI systems are still probabilistic tools. Their outputs can be:

  • Highly effective in some contexts
  • Inconsistent in others
  • Occasionally subtly incorrect while appearing plausible

Because of this, human review takes on an additional function when AI is involved: trust calibration.

Reviewing AI generated code helps teams learn:

  • Where the model performs reliably
  • Where prompts or context are insufficient
  • Which domains require stronger constraints
  • Which tasks can safely be automated

Without this feedback loop, organisations risk developing misplaced confidence in AI output.

Phased Implementation Planning

An alternative to rigid SDD workflows are DIY variants e.g. phased implementation planning where we collaborate with AI to create a staged implementation plan:

  1. Design and Define constraints: Set architecture design, constraints and the core rules the system must always preserve.
  2. Frame the Intent: Use the human's unique context to define the problem for the AI.
  3. Work Phases: Break the work into clearly bounded stages.
  4. Incremental AI implementation: AI creates incremental implementation plans which it then follows, while we review the code to ensure the implementation hasn't drifted from the high-level intent.

This model preserves intentional design while allowing implementation insights to inform the next stage of development.

This approach isn't better or worse and highlights key point: The quality of AI output isn't simply about tool, methodology or framework. Regardless of the approach, it is important to recognise that no framework or tool no matter how sophisticated can automate away the core risks of architectural drift or misaligned intent. They merely change the surface area where these human responsibilities must be exercised.

The Real Risk of Skipping Human Review

Even with strong specifications, automated tests, and AI assistance, removing human review can introduce several categories of risk:

  • Architectural drift as new code subtly violates established system boundaries
  • Hidden complexity that increases long-term maintenance costs
  • Non-functional gaps in resilience, observability, and performance
  • Security and data handling mistakes that are difficult for automation to detect
  • Unidentified spec blind spots that only appear during implementation
  • Overconfidence in AI reliability without sufficient operational experience

None of these risks guarantee immediate failure. Their impact often accumulates slowly, making them particularly dangerous in large systems.

A Situational Approach to Review

This does not mean every change requires the same level of scrutiny and in mature engineering environments, review can be risk-based:

Low-risk changes may safely rely on automation and testing alone. Higher-risk areas such as security boundaries, distributed systems logic, and shared platform components still benefit from human oversight.

... At least they do today for me.

A New Mental Model for AI-Assisted Development?

The future of enterprise software delivery will continue to evolve as the models and tooling continue to improve but at least in the short to medium term are likely combine several elements:

  • Clear specifications and architectural constraints
  • AI-assisted implementation
  • Strong automated quality gates
  • Progressive delivery and safe rollout mechanisms
  • Targeted human review in high-risk areas

Seen this way, code review is not merely a quality control step. It is part of a broader system for managing the evolution of complex software systems in an AI-augmented world.

For the foreseeable future we should still treat human judgment not as a bottleneck, but as a critical component of responsible AI-assisted development. However, as AI becomes more capable, the balance between automation and human oversight will continue to shift. Therefore the question for organisations is what is the right balance between human and AI & where and in what form does human review take place.