AI has reshaped PR productivity, size, and generation speed. The "small / frequent / line-by-line" review policy can't survive as-is. What to bundle as a PR (unit), what to review (target), and who blocks (gate) — three axes must be redesigned together.
1. The problem — why "good PRs" are breaking now
Primary research and industry reports since AI adoption show a consistent pattern: individual productivity rises, but org-level delivery performance stagnates or declines.
- Google DORA 2024 Report: A 25% increase in AI adoption brings +3.4% code quality, +3.1% review speed, +7.5% documentation quality — yet delivery throughput −1.5% and stability −7.2%. 75% use AI output daily; trust in AI output is only 39%.12
- Google DORA 2025 — State of AI-assisted Software Development: AI acts as an amplifier; it scales whichever organizational strengths and weaknesses already exist. Strategic investment in foundational systems determines ROI more than tooling does.3
- Faros AI "Productivity Paradox" 2025 (10,000+ developers, 1,255 teams): high-AI teams complete +21% more tasks and merge +98% more PRs, but PR review time rises +91%. By Amdahl's law, the slowest stage (review) governs overall throughput.45
- Faros AI Engineering Report 2026 ("Acceleration Whiplash"): a "senior engineer tax" emerges. Median time-to-first-review +156.6%, average review time +199.6%, median linger time +441.5%. AI output looks plausible on the surface, hiding defects deeper and making review cognitively more expensive.6
- GitClear 2025 AI Copilot Code Quality (211M LOC, 2021–2025): "moved" code — the marker of refactoring/reuse — dropped from 25% to under 10%, while copy/paste rose from 8% to ~18%. For the first time on record, copied lines exceeded moved lines. AI optimizes short-term output at the cost of long-term maintainability.7
- Peng et al., arXiv 2302.06590 (GitHub/Microsoft): in a controlled trial the Copilot group completed identical tasks 55.8% faster. Individual coding velocity gains are real.8
- SmartBear–Cisco Code Review Study (the largest industry code-review study): defect detection holds at 70–90% when a review covers 200–400 LOC, ≤500 LOC/hour, within 60–90 minutes. Past that envelope detection drops sharply — meaning the 1,000-line PRs AI generates are structurally beyond human review.910
The slowest stage (review) governs throughput — Amdahl's Law
| Assumption behind today's policy | What's true after AI (source) |
|---|---|
| Keep PRs under 200–400 lines | The SmartBear–Cisco threshold reflects human cognition; AI clears it in a single shot |
| Reviewer reads line-by-line and LGTMs | Faros AI: +98% merges, +91% review time; 2026 follow-up shows median linger time +441.5% |
| The author understands intent best | Intent lives in the spec, not in AI-generated code — the rationale behind Spec-Driven Development (arXiv 2602.00180) |
| Quality is maintained by refactor/reuse | GitClear 2025: moved 25% → <10%, copy/paste 8% → ~18% |
| 1–2 reviewers gate the merge | DORA 2024: throughput −1.5%, stability −7.2% — individual speed ↑ vs system performance ↓ |
2. What to change — redesigning the three axes
Axis 1. PR unit policy — one intent, not a line cap
- Before: "PRs under 400 lines" (the SmartBear–Cisco-derived industry consensus)10
- After: "A PR is one verifiable intent" — the LOC cap is demoted to a secondary signal
Concrete rules:
- 1 PR = 1 acceptance-criteria unit. Even an 800-line PR is fine if it satisfies a single spec.
- AI-generated boilerplate (tests, migrations, generated code) gets a separate label so reviewers know what to read closely vs skim. (GitClear's "moved vs copy/paste" distinction becomes a policy signal.)7
- LOC caps become warning lines, not blocking: when the SmartBear–Cisco threshold (~400 LOC, ~500 LOC/hr) is exceeded, an automated comment asks "Can this intent be split further?"9
Axis 2. Review target policy — review intent, not code
Reviewer → Verifier. Review the spec, acceptance criteria, and constraints, not the diff. Spec-Driven Development (SDD) treats the spec as the primary artifact and code as the secondary one — directly aligned with relieving the AI-era code-review bottleneck.11
SDD-adjacent findings:
- Constitutional Spec-Driven Development (arXiv 2602.02584): enforcing constraints as a "constitution" at the spec stage cut security defects by 73% in-domain, with no slowdown.12
- Red Hat guide: compared to ad-hoc "vibe coding," SDD raises the consistency and verifiability of AI output.13
What the new PR template should contain:
- Intent — what is changing and why (1–3 sentences)
- Acceptance Criteria — what must pass for "done" (Given/When/Then or a checklist; Gherkin recommended)11
- Constraints / Non-goals — what must not be touched, domain contracts
- Verification Evidence — test output, screenshots, logs, benchmarks (a human must be able to reproduce)
- AI Co-author ratio / Risk zone — annotate which parts AI generated and which a human verified
- Rollback Plan — especially required where instant rollback is impossible (e.g. mobile)
Axis 3. Gate policy — single approval → multi-layer trust (Swiss Cheese)
A single LGTM gate can't hold any more. Google Engineering Practices still states "the primary goal of code review is to improve the health of the codebase," but in the AI era that work has to spread across multiple gates to avoid the one-reviewer bottleneck.14
A PR must clear L1 → L2 → L3 → L4 to merge; L5 is the post-merge safety net
Layer-by-layer rationale:
- L1 is the "small batch + robust testing" basics DORA 2024 emphasizes.1
- L2 exploits the "cognitive load reduction from AI" measured by CACM (Ziegler et al., GitHub Research).15
- L3 is the pre-code intent and acceptance-criteria review SDD recommends.11
- L4 is bounded to tribal knowledge, regulated paths, and native critical paths.
- L5 is post-merge recovery (feature flags, canary, auto-rollback) that reinforces DORA's stability metrics.1
3. Mobile / app considerations
"Ship fast, revert faster" is comfortable on server / web, but mobile apps have deploy cadence and rollback constraints. The "small batch + robust testing" principle DORA 2024 highlighted needs heavier application on mobile.1
Same PR, different surface — the merge gates must adapt to deploy economics
- Make pre-deploy gates (L1–L4) heavier, and lean less on post-deploy observability (L5).
- Native critical paths — payments, signing, key management, WalletConnect — require L4 human block. Exclude them from AI-auto-merge.
- Include UI snapshot tests and UI automation tests in L1. Visual regressions are hard to catch via observability.
- Split OTA/CodePush-able areas from native code via PR labels to reflect rollback-cost differences in policy.
- State the force-update policy and revert cost in every PR description.
4. Metrics — what to measure so the policy stays alive
To see whether the policy is working, layer AI-era-specific metrics on top of DORA's four keys (throughput / stability).14
| Metric | Meaning / source | Goal |
|---|---|---|
| PR Review Lead Time (median / mean) | Faros AI's most-degraded headline metric.6 | ↓ |
| Code Churn Rate (% of new code rewritten within 2 weeks) | Tracks GitClear's short-term churn at the team level.7 | ↓ |
| Copy/Paste vs Moved Code Ratio | GitClear's central signal — duplication vs reuse.7 | Moved ↑, Copy/Paste ↓ |
| Rubber Stamp Rate (% of PRs with 0–1 review comments) | Signal of review formalization; SmartBear's "active review" principle.16 | ↓ |
| Delivery Throughput / Change Failure Rate / MTTR | DORA 4 keys — confirms org-level effect.1 | Throughput ↑, Failure ↓, MTTR ↓ |
| Spec Review Coverage | % of PRs with explicit acceptance criteria — SDD adoption indicator.11 | ↑ |
5. A four-phase transition roadmap
Capture baseline: DORA 4 keys + review metrics + churn
Enforce Intent/AC/Evidence template, add L2 AI code review
Pre-code acceptance-criteria review (1–2 teams)
Humans gate only critical paths; rest auto-approved
- Phase 0 (Measure, 2 weeks) — capture DORA 4 keys + Faros-style review metrics + GitClear churn as the baseline.17
- Phase 1 (PR template + AI first-pass review, 1 month) — enforce the new Intent/AC/Evidence template; add GitHub Copilot Code Review (or equivalent) as L2.15
- Phase 2 (Spec / Intent review pilot, 2 months) — adopt pre-code acceptance-criteria review on 1–2 teams (SDD).11
- Phase 3 (Human Block Zone separation, ongoing) — humans gate only critical paths; the rest auto-approves via L1+L2+L5 while preserving DORA's "basics" principle.1 Goal: review wait time −50%, production defect rate stable.
6. Open questions — leaving them honest
- The spec-writing bottleneck. The code-review bottleneck may simply migrate into a spec-review bottleneck. The SDD literature names this limit explicitly.11
- Junior growth paths. Traditional code review was also a learning mechanism (Google eng-practices stresses mentoring).14 What replaces it in a verifier-centric model?
- Legal accountability. Who owns failure when AI generates and AI reviews? DORA 2024's reported 39% trust figure points to this gap.1
- Same-kind verification blind spot. When the same model family generates and verifies, systemic biases reproduce on both sides.
- Individual vs organization gap. Faros and DORA agree on "individual productivity ↑, org delivery ↓." How do we narrow it?41
7. Conclusion — in one line
A "good PR" is no longer a small diff. It's one verifiable intent plus the multi-layered evidence proving it.
Demote the LOC cap to a warning line (SmartBear–Cisco threshold); promote intent, acceptance criteria, and multi-layer gates to first-class policy.
References
Google DORA Research
- Accelerate State of DevOps Report 2024
- 2025 State of AI-assisted Software Development
- Announcing the 2024 DORA report — Google Cloud Blog
Faros AI Research
- The AI Productivity Paradox Report 2025
- Are AI coding assistants really saving time, money and effort?
- The AI Engineering Report 2026: The AI Acceleration Whiplash
GitClear
SmartBear / Cisco Code Review Study
- Code Review at Cisco Systems (full PDF)
- What Is Code Review? — 200–400 LOC / 500 LOC/hr threshold
- Best Practices for Peer Code Review
Google Engineering Practices
Academic papers (arXiv / CACM)
- Peng et al., The Impact of AI on Developer Productivity: Evidence from GitHub Copilot — arXiv 2302.06590
- Ziegler et al., Measuring GitHub Copilot's Impact on Productivity — Communications of the ACM
- Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants — arXiv 2602.00180
- Constitutional Spec-Driven Development: Enforcing Security by Construction — arXiv 2602.02584
Industry analysis
- McKinsey — Unleashing developer productivity with generative AI
- Red Hat — How spec-driven development improves AI coding quality
Footnotes
-
Accelerate State of DevOps Report 2024 — DORA ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10
-
Are AI coding assistants really saving time, money and effort? — Faros AI ↩
-
The AI Engineering Report 2026: The AI Acceleration Whiplash — Faros AI ↩ ↩2
-
Peng et al., The Impact of AI on Developer Productivity — arXiv 2302.06590 ↩
-
How spec-driven development improves AI coding quality — Red Hat ↩
-
Measuring GitHub Copilot's Impact on Productivity — CACM (Ziegler et al.) ↩ ↩2