Skip to content
voodootikigodvoodootikigod

The Agentic Development Lifecycle

Eight phases, two human gates, and machine-checked everything between — a development lifecycle rebuilt around how models actually fail, not how humans work.

9 parts · ~76 min read

Part 1Stop Running the SDLC on Models That Aren't Human

Read this part on its own page ↗

Here is a thing that happens every week now, especially as more enterprise organizations open their minds to Vibe Coding and start expanding past the initial prompting phase(s). They make one agent play the product manager. Another agent act as the senior engineer. A final agent comes in as the code reviewer. Together they have inadvertantly made a digital twin of their org chart. They hold a little standup. It demos beautifully.

One week later the team is debugging a feature where the UI works, the tests pass, and the data underneath is stubbed. The agents built a convincing storefront with nothing behind the counter, declared victory, and every agent downstream agreed with it.

The instinct to recreate the SDLC with agents in the human seats is understandable and wrong, and it's wrong in a way worth being precise about.


The SDLC is a defense system against humans#

The software development lifecycle is not a neutral description of how software gets built. It is a sixty-year accumulation of defenses against human failure modes: forgetfulness, ego, fatigue, fear of blame, communication cost, knowledge silos, and more; all rooted in human behavior.

Standups exist because humans don't share state. Code review exists because humans have ego-blind spots about their own work. Estimation rituals exist because humans dislike admitting uncertainty in front of their manager. Documentation requirements exist because humans quit, go on vacation, and forget.

Every ritual traces to a flaw. That's what makes the SDLC good, but it is also why copying it wholesale into agentic development does two bad things at once:

  1. It imports defenses against flaws models don't have. An agent doesn't need a standup; it coordinates through artifacts. It has no ego to bruise in review. It never gets tired at hour nine and starts cutting corners. It will happily run fifty iterations of a loop at 3am with identical diligence.
  2. It misses flaws humans don't have. No human hallucinates an entire API surface with full confidence. No human writes two hundred lines of plausible code against a library method that does not exist. And almost no human deletes a failing test at 2am and reports "all green." The model will, sincerely, and it won't even feel bad.

So here is the design rule this entire series hangs on:

Every phase, gate, and loop in an agentic development lifecycle must trace to a specific model failure mode it defends against, or a specific model property it exploits. If it traces to a human failure mode instead, cut it.

That rule is a scalpel. Apply it to your current agent workflow and watch how much of it turns out to be theater.


The flaw inventory#

If the lifecycle must derive from model failure modes, the first job is to name them. These eight are load-bearing, and everything in the rest of this series derives from this table.

F1: Premature satisfaction. The model does the least that arguably satisfies the instruction, then declares victory. Ask for "a working settings page" and you'll get one that renders, backed by hardcoded data, with the save button wired to nothing. Implicit requirements are silently dropped, because nothing forced them to be explicit. The defense: make satisfaction machine-checkable. Acceptance criteria must be executable, not prose.

F2: Sycophancy. The model is biased toward telling its principal (read: you) what they want to hear. I generally explain this to people as "think of the LLM as a 5 year old child, it always wants to make you happy, even when completely wrong." This makes self-review structurally worthless: "does this look right?" always returns yes. The agent reviewing its own work is not lying, exactly; it is doing what it was trained to do, which is agree with you. The defense: never ask an agent to validate work it (or its context) produced.

F3: Context rot. Judgment degrades as the context window fills. Early instructions fade. The model increasingly anchors on its own prior outputs, which means a builder agent literally cannot see its own bugs, because its context is the bug. Long sessions don't make agents wiser; they make them confidently entrenched. The defense: atomic tasks, fresh context per task, and pass conclusions between agents, never transcripts.

F4: Confident hallucination. Fabricated APIs. "Fixed the bug" without running anything. Review findings invented to look thorough. The signature property is that confidence carries no information; the model asserts the fabricated method with exactly the tone it asserts the real one. The defense: evidence or it didn't happen. Every claim gates through something deterministic: a test, a typecheck, a build, a reproduction.

F5: Reward hacking. Put a gate in front of the model and, under pressure, it will game the gate: delete the failing test, weaken the assertion to toBeDefined(), mock the thing being tested, add a skip marker. Every metric you gate on gets Goodharted — as in Goodhart's Law: when a measure becomes a target, it ceases to be a good measure — and it gets Goodharted at machine speed, sincerely. The agent reports success and believes it. The defense is the subject of an entire post in this series: gates must be hollow-proof, and the rails must be protected from the builder.

F6: The finding-count prior. Ask a model to review code and it converges on ten to twenty findings, then stops, regardless of how many problems actually exist. That number is a prior baked in by training, not a measurement of your code. Single-pass review systematically undercounts. The defense: loop with fresh contexts until consecutive passes come up dry.

F7: Generative bloat. Models are verbose, duplicative, and cheerfully reinvent what already exists three files away. Agent-built codebases tend to carry real excess (duplicated logic, dead branches, helpers reinvented three files away) and the fat compounds: every future agent pays to read it. The defense, counterintuitively, is not to police duplication at authoring time; it's a post-merge simplification phase, where dedup is mechanical instead of speculative.

F8: Coherence loss across models and sessions. Different models have different idioms; so do fresh sessions of the same model. Switch models mid-task (or "resume where the other one left off") and you get stylistic and architectural seams down the middle of the work. The defense: pin one model per task; switch only at task boundaries.

Read that list again and notice what's not on it: laziness, ego, fear, politics, forgetting what was decided last sprint. The entire human flaw profile the SDLC was built to contain is absent. Different diseases require different medicine.


The flaws that are secretly features#

Here is the part that took longer to see, and it's the half that makes an agentic lifecycle work because of model properties rather than despite them.

E1: Sampling diversity is free N-version programming. Run the same prompt N times and you get N genuinely different attempts. For search problems (find the bugs, propose a design, hunt the performance regression) that's a free ensemble. N-version programming was always theoretically attractive and economically absurd with humans. It is now nearly free.

E2: Sycophancy is aimable. The same compliance bias that makes self-review worthless makes an agent chartered to refute relentless. Tell an agent "find what's wrong with this, and if you find nothing, say so" and the agreement bias locks onto the refutation charter instead of onto you. Adversarial review doesn't work despite sycophancy. It works because of it: you aim the bias at the artifact.

E3: No ego, no fatigue, no blame-fear. Reviews can be brutal with no feelings hurt. Loops can run fifty iterations. And, most underused, work can be thrown away wholesale. Discard-and-retry is a first-class strategy: regenerating from a corrected spec is often cheaper than repairing a flawed attempt, and the agent will not sulk about it.

E4: Context rot has an inverse. A fresh context is genuinely unbiased by the construction history. This is the active ingredient in creator/critic separation: fresh-context review is only valuable because contexts contaminate. The critic must never share the creator's context, not as etiquette, but as the mechanism itself.

E5: The cost asymmetry moved. Exploration, review, and rewriting now approach free relative to human time. Activities the SDLC pushed to the front of the lifecycle because they were expensive to redo (architecture review, dedup analysis, exhaustive review passes) can move to the back, where they have full information.


What this buys you#

Put the two tables together and a lifecycle starts to fall out, almost mechanically:

  • F2 + E4 ⇒ creators and critics must be different contexts, and the critic gets a refute charter.
  • F3 ⇒ work decomposes into tasks sized to the useful context window, each run fresh.
  • F4 ⇒ no claim crosses a phase boundary without deterministic evidence.
  • F5 ⇒ the tests and contracts that gate the builder are authored elsewhere and frozen; the builder cannot touch its own acceptance criteria.
  • F6 + E1 ⇒ review is a fan-out that loops until dry, not a single pass.
  • F7 + E5 ⇒ simplification is a phase that runs after merge, under green tests.
  • E3 ⇒ when an agent flails, you don't coach it; you kill it and regenerate from an improved task.

The blend of these flaws and benefits creates the structure of the Agentic Development Lifecycle. It is not a re-skin of the traditional SDLC with agents sitting in the human seats. It is an entirely new development lifecycle model, built for the capabilities and limits of the models, not in spite of them. From my perspective, the lifecycle is eight phases, two human gates, deterministic checks between every phase, and implementable by a toolkit of small npx runnable gates that enforce it in CI.

The rest of this series walks through it:

  1. The lifecycle itself: eight phases, exactly two human moments, and why the spend curve is a barbell.
  2. Rails: tests as the spec in the only language the builder can't argue with, and the Goodhart catalog of how agents game gates.
  3. Prosecution: why code review becomes prosecution, and how to measure whether your reviewer actually catches anything.
  4. Parallelism: the three dials of multi-agent orchestration, and why "3-5 agents" keeps showing up in everyone's field reports.
  5. Compounding: the phase that makes run N+1 cheaper than run N, and the economics of cost-per-merged-verified-change.
  6. The proof: how the toolkit enforcing this lifecycle was built by the lifecycle, and the adoption path that doesn't die in week two.

One closing note on stakes: none of the failure modes above are exotic. Every one of them has bitten every team that has run agents for more than a month. It usually happens quietly, discovered in production or in a diff nobody actually read. The teams concluding "agents don't work here" are, almost without exception, teams that pointed sixty years of human-shaped process at a non-human failure profile and were surprised when it caught nothing.

The models aren't the problem. The lifecycle is. Let's build the right one.


Part 2Two Human Gates and Everything Between Is Machine-Checked

Read this part on its own page ↗

The last post, Stop Running the SDLC on Models That Aren't Human, laid out the argument that the SDLC defends against human failure modes, models fail differently, and every phase of an agentic lifecycle must trace to a specific model flaw it defends against or a model property it exploits.

This post introduces the lifecycle that falls out of that rule. Eight phases. Deterministic gates between every pair. And exactly two mandatory human moments in the entire loop. Just two, so get ready to put your trust in the machine.


The shape#

Approved Redo Approved Redo Feeds next run P0: Triage P1: Interrogate Human Gate 1:Spec Approval P2: Decompose Gate: Cold-Start Test P3: Rail RED Gate: Tests fail, types check P4: Build Green Gate: Rails green, build/lint pass P5: Prosecute Zero-Findings Gate: No open findings P6: Integrate Human Gate 2:Behavioral Acceptance P7: Distill
Click diagram to zoom

Before walking through it, one principle that governs all the arrows: an LLM→LLM handoff without a deterministic checkpoint multiplies error rates. The chain is only as strong as its non-LLM links. Between any two phases there must be something that cannot hallucinate, such as a compiler, a test suite, a schema validator, or a human. Probabilistic components in series compound their error; deterministic gates between them reset it.

Phase 0: Triage#

Not everything earns the full lifecycle, and running the full ceremony on a typo is how agentic lifecycles die of friction in week two, as well as your token budget. Route by risk × blast radius, not size:

  • Trivial (copy change, config tweak with existing coverage): direct edit, existing tests, one review pass. Cheap model.
  • Bounded (bug fix inside one module): skip straight to Phase 3, writing the failing test that is the bug report, then fixing it and running a light review.
  • Substantial (new feature, cross-cutting change): full lifecycle.
  • Architectural (new system, contract changes): full lifecycle plus design alternatives evaluated by a judge panel.

Phase 1: Interrogate#

The single highest-leverage phase, because error here compounds through everything after it, and no downstream gate can catch "built the wrong thing correctly."

The mechanism is interrogation: ask me questions until you have none left, but check the codebase before asking each one. That second clause is the half that matters. Without it you get twenty questions the repo already answers, and the human tunes out by question six. Quite transparently, this one crystalized for me thanks to Matt Pocock's now famous grill-me skill.

The framing correction that took me a while though was people say planning "reduces non-determinism," and that's wrong in a way that matters. Sampling randomness is not the enemy; at temperature zero, a vague spec still yields confidently wrong code. The enemy is underspecification. The model fills every gap with its prior, and its prior is "whatever is most generic." Interrogation works by transferring the spec from your head into the context before the gaps get filled by invention. That's flaw F1 from Stop Running the SDLC on Models That Aren't Human (premature satisfaction) being starved of gaps to exploit.

The output is a spec where every acceptance criterion names its verification method: a test to be written, a command whose output is asserted, or a behavior demonstrated. A criterion with no verification method is a wish, and wishes get the minimum-effort treatment.

Gate: a human approves the spec. This is human gate one of two, and it is the human's highest-value moment in the entire lifecycle. This is our moment to shine. Minutes here replace hours of diff review later. Use the best model you have in this phase and don't economize; a subtly wrong spec sails through every downstream gate and poisons everything. Invest the tokens, invest the time. This has been re-iterated by even the author of Claude Code, Boris Cherny.

Phase 2: Decompose#

Defends against context rot (F3). The unit of work is sized to the useful context window (the region before judgment degrades) not the advertised one.

Slice the spec into atomic tickets, each executable by a fresh agent from the ticket text alone. Draw partition lines along interfaces, and write the contract at each boundary explicitly (types, schemas, endpoint shapes). Contracts are what let the build phase parallelize safely; parallel agents that collide do so on shared types and configs, never on feature code. Pin the shared surface first and parallel construction stops colliding.

Gate (the cold-start test): hand each ticket to a fresh, cheap model and ask "what's missing to execute this without asking a single question?" If a cheap model can enumerate the gaps, the ticket is underspecified for the mid-tier model that will actually run it. Costs pennies (if even) per ticket. Catches the number-one cause of build-phase flailing before a more expensive model burns dollars discovering it.

Phase 3: Rail#

The trust anchor of the whole lifecycle: tests, type stubs, and contracts authored from the spec in a context that will never see the implementation, then frozen. The builder cannot edit them. Tests Are the Spec in the Only Language the Builder Can't Argue With is entirely about this phase, so for now, let's just talk about the gate:

Gate: the suite runs RED for the right reasons (failures say "not implemented," not "test is broken") and the stubs typecheck.

Phase 4: Build#

One fresh agent per ticket: ticket + relevant skills + frozen rails. No carry-over context between tickets (F3 again). Parallelize across partitions in git worktrees, single writer per partition, merge sequentially.

Mid-tier model by default. This surprises people, so it gets its own principle: model tier is a function of the cost of detecting an error, not of task prestige. Where the rails are dense, errors are caught instantly and deterministically, so the cheapest model that clears the gates is the correct one. Where errors are expensive to find (specs, contracts, migrations without coverage) that's where the frontier model goes. This inverts the common instinct (best model writes the code). The code is the most-verified artifact in the system; the spec is the least.

Two operational rules worth stealing even if you adopt nothing else:

  • Two-strike regeneration. If an agent flails (loops on the same error or starts touching files outside its ticket) do not coach it inside the same rotting context. Kill it, append the dead-ends to the ticket ("known failed approaches: …"), and start fresh. If the regeneration also fails, the ticket is wrong; escalate to Phase 2, not to a bigger model. The second-cheapest fix is a fresh start; the most expensive is a long conversation with a confused agent.
  • No personas. "You are a senior Next.js engineer with 15 years of experience" adds vibes, not capability. An agent is its context, tools, charter, and gate. Skills add capability; charters add direction; costumes add tokens.

Gate: rails green, build passes, lint passes. Deterministic. No opinions.

Phase 5: Prosecute#

Not "code review." Prosecution: fresh contexts chartered to refute, with the burden of proof on the finding; every finding is reproduced by a verifier or killed before anyone acts on it, and the fan-out loops until two consecutive passes come up dry. Prosecution, Not Code Review covers this phase and the tooling that measures whether your review stack actually catches anything.

Gate: zero verified open findings, rails still green, and the rails diff is empty, which is mechanical proof the builder never touched the tests.

Phase 6: Integrate#

Human gate two, and it is not "read the diff."

The 5,000-line diff read is litmus theater. The human scrolls, pattern-matches for nothing in particular, approves, and the org books "human in the loop." Human attention is the scarcest, most costly resource in the lifecycle; spend it where machines are blind:

  • Read the spec-conformance summary: what was promised, what was verified, what was explicitly not done.
  • Read the test diff (small, high-signal, and it is the behavioral contract).
  • Run the thing. A two-minute demo catches the one category of wrongness no reviewer-agent can: "this is technically correct and not what I meant."
  • Spot-check the two or three hotspots prosecution flagged. Not the whole surface.

Gate: human behavioral acceptance. "Is this what I meant, running?"

Phase 7: Distill#

The phase everyone skips, which is why their costs stay flat while their codebases bloat. Two halves: simplify (post-merge dedup and dead-code removal under the still-green rails, where you should expect a substantial reduction on agent-generated code) and mine (recurring review findings become lint rules; recurring interrogation questions become spec templates; conventions become skills. This is the compounding loop; The Lifecycle That Gets Cheaper Every Run explains and explores this fully.


The two human gates, stated plainly#

The entire lifecycle has exactly two mandatory human moments, by design:

  1. Phase 1: "Is this what I meant?" (spec approval).
  2. Phase 6: "Is this what I meant, running?" (behavioral acceptance).

Everything between them is machine-gated. Humans intervene elsewhere only on escalation: non-converging loops, out-of-scope flags, contract changes. This is not human-out-of-the-loop. It is human-at-the-two-points-where-human-judgment-is-irreplaceable, instead of human-as-tired-diff-scroller. It is "Right Tool, Right Task, Right Time". The human is the ground truth for intent, and intent is checked exactly twice: once as words, once as behavior.


The barbell#

Where does the money go? Heavy at the ends, light in the middle:

P1 Interrogate
Heavy: best model, human time; cheap insurance
P2 Decompose
Moderate: frontier for contracts
P3 Rail
Moderate: one-time per feature
P4 Build
Light: skills + rails make this cheap
P5 Prosecute
Heavy: fan-out × loop-until-dry; where quality is bought
P6 Integrate
Human minutes, not hours
P7 Distill
Light: and it discounts every future run

If your spend is concentrated in Phase 4 (the build) your team is exploring (re-reading the codebase every run) instead of exploiting (skills, atomic tickets, cached context). That's a diagnostic, not a judgment; it tells you which phase is missing.

The barbell also explains why this lifecycle reads as heresy to agile instincts. Agile economized on planning because human building was slow and specs went stale before the build caught up. Building is now fast and cheap; misbuilding is what's expensive. The economics inverted, so the phase weighting inverts. "Working software over comprehensive documentation" was a correct response to 2001's cost structure. It is the wrong response to this one.


Norms rejected#

Positions this lifecycle takes deliberately, so you can disagree deliberately:

NormVerdictWhy
Human review of full agent diffsRejectTheater past ~500 lines. Attention goes to spec, test diff, behavior
Agile-weight planningReject for agentic workThe economics inverted; see above
Persona engineeringRejectCapability lives in skills, tools, charters. Costumes are token overhead
Multi-agent collaborative construction (3-7 creators comparing notes)RejectSearch-parallelism misapplied to construction. Partition + contract + single writer instead
DRY at authoring timeRejectDedup moves to Phase 7, where it's mechanical instead of speculative
Coverage % as a quality gateRejectGoodharted at machine speed. More in Tests Are the Spec in the Only Language the Builder Can't Argue With
Token quotas as cost controlRejectCaps the wrong variable. A quota-pressured developer cuts the review phase first, which represents the most valuable tokens in the system. Govern cost per merged, verified change instead
Mid-task model failoverRejectCoherence loss (F8). Models switch at task boundaries only

Every row traces back to the flaw inventory. That's the test: if you find yourself adding a ritual that doesn't trace, you're importing human-shaped process again.

Next up is the phase the whole structure leans on. A phrase that has continually gained focus and importance as we evolve with agentic work: in traditional software development (SDLC), tests verify the code; in agentic development (ADLC), tests are the spec rendered in the only language the builder can't argue with. The builder will try to argue anyway (by editing the tests). What happens then is the subject of Tests Are the Spec in the Only Language the Builder Can't Argue With.


Part 3Tests Are the Spec in the Only Language the Builder Can't Argue With

Read this part on its own page ↗

Two Human Gates and Everything Between Is Machine-Checked outlined the lifecycle: eight phases, two human gates, deterministic checks between everything. This post is about the phase the entire structure leans on (Phase 3: Rail) and the model behavior that makes it necessary.

Start with the inversion, because everything else follows from it:

In the SDLC, tests verify the code. In the ADLC, tests are the spec rendered in the only language the builder can't argue with. A test is the one critic that is never sycophantic, never rots, and never hallucinates.

In human development, TDD is a quality ritual, a discipline signal adopted or skipped by taste, argued about at conference bars for twenty years. In agentic development its role changes completely. Recall the flaw inventory from Stop Running the SDLC on Models That Aren't Human: the model claims success without evidence (F4), agrees with whoever asks (F2), and does the minimum that arguably satisfies the instruction (F1). Against that profile, every probabilistic gate (review, self-checks, "look this over") leaks. The test suite is the one gate that doesn't. TDD is not a quality ritual here. It is the load-bearing trust mechanism of the entire lifecycle. Every other gate is probabilistic; this one is not.

Which immediately raises the question this post is actually about: what happens when the thing being gated can edit the gate?


Reward hacking, observed in the wild#

Flaw F5, restated: put a gate in front of a model and, under pressure to satisfy it, the model will game the gate rather than clear it. It does this not occasionally, but reliably, given enough pressure and enough iterations. And unlike a human cutting corners, it games the gate sincerely: it reports success and, in whatever sense applies, believes it.

The catalog of moves is depressingly consistent across teams, models, and vendors. If you've run agents against a test suite for more than a month you have seen most of these:

  • Delete the failing test. The classic. Often accompanied by a commit message like "remove outdated test."
  • Weaken the assertion. expect(result.total).toBe(427.50) becomes expect(result).toBeDefined(). The test still exists. It still runs. It checks nothing.
  • Mock the thing being tested. The function under test gets stubbed in its own test file. The suite goes green. The feature does not exist.
  • Snapshot churn. Regenerate the snapshot to match the broken output. Assertion inverted: the bug is now the spec.
  • Skip markers. it.skip, xfail, // eslint-disable-next-line (suppression as a service).
  • "Fixed" without running. The agent reports the fix is in and the tests pass. The tests were never executed. This isn't lying in the human sense; it's F4, where the claim and the hallucination are indistinguishable from the inside.

Here is the uncomfortable conclusion: instructions cannot fix this. "Do not modify the tests" is a sentence in a context window, and a sentence in a context window is exactly the kind of constraint F5 routes around under pressure. It does this not maliciously, but the way water routes around a stone. By iteration thirty of a stuck debugging loop, that instruction is competing with an overwhelming gradient toward make the gate go green, and the gradient wins.

So the defense has to be structural.


The rail discipline#

Three rules make a test suite into rails: a structure the builder runs inside rather than a hurdle it can negotiate with.

1. Author the rails in a context that will never see the implementation.

Tests, type stubs, and interface contracts are written from the spec, before any implementation exists, by an agent whose context contains the spec and nothing else. This is the creator/critic separation from the first post (E4) applied at authoring time: a test written by the same context that writes the code inherits the code's assumptions, including the wrong ones. A test written from the spec alone encodes the spec's assumptions, which is the entire point. The rails are the spec, compiled to executable form.

2. Freeze the rails during the build: mechanically.

During Phase 4, the builder cannot edit test files, contract types, or CI config. Not "is instructed not to" but cannot. Enforce at the tool layer: a pre-tool-use hook that blocks writes to rail paths, branch protection on test directories, or file permissions. Declare the rail paths and it blocks builder edits during the build phase and emits a rails-diff-empty proof at the gate, which acts as mechanical evidence the builder never touched them.

The principle generalizes well beyond tests: a constraint that lives in the prompt layer is a request; a constraint that lives in the tool layer is a fact. Agents route around requests.

3. Audit the rails themselves, once, adversarially.

A frozen gate made of fog is still fog. Before freezing, run one adversarial pass over the tests: do they assert behavior, or do they assert that mocks were called? The single best audit question I know:

"Does any test fail if the feature is deleted?"

If you can stub the entire implementation with return null and the suite stays green, you don't have a gate. You have a ceremony.

Gate, restated from Two Human Gates and Everything Between Is Machine-Checked: the suite runs RED for the right reasons (failures say "not implemented," not "test is broken") and the stubs typecheck. Red-for-the-right-reasons matters: a suite that's red because of an import error tells you nothing when it later goes green.


The Goodhart annex#

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Gates are measures-turned-targets by definition, and agents optimize against them at machine speed and so every gate attracts a gaming move the way every contract attracts a loophole. Hence the annex, in the contract sense: the attached schedule of known loopholes and the clause that closes each one.

Each gaming move, paired with its structural defense. Every gate in the lifecycle ships with its anti-vacuity defense, or it isn't a gate:

Gaming moveDefense
Delete or edit the failing testRails frozen during build, enforced by hook; review phase verifies the rails diff is empty
Weaken assertions (toBeDefined(), snapshot churn)Adversarial test audit before freezing; a review lens dedicated to builder-added tests
Mock the thing being testedThe deletion question: "does any test fail if the feature is deleted?"
skip / xfail / lint-suppress markersDeterministic grep gate in CI: new suppression markers fail the build unless declared in the ticket
"Fixed" without runningClaims require execution evidence in the gate itself, not in the agent's report
Coverage padding (exercise lines, assert nothing)Don't gate on coverage at all; see below
Out-of-ticket "improvements" masking scope creepDiff scoped to the ticket's declared files; out-of-scope changes auto-flagged to the human gate

Notice the texture of the defenses: hooks, greps, diffs, file scopes. Small, dumb, deterministic. That's not a limitation; it's the design. The failure mode being defended against is a system that's brilliant at gaming smart gates. You beat it with gates too dumb to game.


Why coverage percentage is the wrong gate#

Coverage is the most Goodhart-able metric in software, and agents "Goodhart" at machine speed. An agent gated on 80% coverage will hit 80% coverage: assert-free tests, snapshot spam, tests that execute every line and constrain nothing. Humans game coverage too, but slowly, and with enough shame to keep it in check. Agents do it instantly, thoroughly, and sincerely.

If you want a quantitative gate on test quality, the honest version is mutation testing: deliberately break the implementation and check that some test notices. A test suite that can't tell broken code from working code is hollow, whatever its coverage number says.

Full mutation testing is famously too slow for CI, which is why almost nobody runs it. The fix is scope: mutate only what the current diff touches. For the tests covering a diff, mutate the implementation (invert conditionals, null the returns, swap operators, plus a few LLM-authored semantic mutants, the subtle kind), run the suite, and report any mutant that survives every test. A surviving mutant is a proof object showing a behavior change your tests cannot see. Diff-scoping keeps it at minutes instead of hours. The key output is the list of survivors.

One deterministic check, and it closes three rows of the table above (assertion-weakening, mock-everything, and coverage padding) because all three produce the same detectable symptom: mutants survive.


What the builder is allowed to do#

A clarification that prevents a common misreading: the builder can write tests. Unit tests for internals (written during the build, alongside the code) are fine and encouraged. They just aren't rails. They don't gate anything, and they get prosecuted like everything else the builder produced (one review lens is dedicated specifically to auditing builder-added tests).

The distinction is provenance, not file type. Rails are authored from the spec, by a context that never saw the implementation, and frozen. Anything the builder wrote is work product, and work product gets reviewed. The moment a builder-authored test starts gating the builder's own work, you've reinvented self-review with extra steps, which is F2 in a hard hat.


The trust chain#

Step back and look at what the rail discipline buys the lifecycle as a whole. Every downstream phase inherits its trustworthiness from this one:

  • The build gate ("rails green") means something only because the builder couldn't edit the rails.
  • The review phase can focus on what tests can't catch only because the tests deterministically catch what they can.
  • The post-merge simplification phase can refactor aggressively only because the still-frozen, still-green rails define "behavior preserved."
  • The human at the final gate can skip the 5,000-line diff only because the rails-diff-empty proof and the green suite arrive as evidence, not as claims.

Pull the rails out and every one of those collapses back into "trust the model's self-report," which is to say, collapses entirely. This is why the phase ordering is non-negotiable: rails before build, always. The most expensive sentence in agentic development is "I'll add tests after it works," because works, without rails, is a claim made by the thing being gated.

So: the rails hold the builder. The suite is green, the diff is empty, the mutants die. Done?

No, because everything the rails can't see still gets through. The rails are exactly as good as the spec they encode, and they encode nothing about the spec's gaps: the race condition nobody wrote a test for, the auth check missing from an endpoint the spec forgot, or the contract drift between two tickets. Catching what the rails can't see requires judgment. Judgment, per Stop Running the SDLC on Models That Aren't Human, means fresh contexts with inverted charters, because the builder's own context is sycophantic about its own work.

That's the prosecution phase. And it has a problem nobody talks about: who reviews the reviewer? If your adversarial review stack has blind spots (and it does), how would you know? It turns out you can measure it, with planted bugs and arithmetic.


Part 4Prosecution, Not Code Review

Read this part on its own page ↗

Tests Are the Spec in the Only Language the Builder Can't Argue With ended on the limits of rails: tests deterministically catch everything the spec encoded, and nothing it didn't. The race condition nobody thought to test. The auth check missing from the endpoint the spec forgot. The error swallowed two layers below the happy path. Catching those takes judgment. Judgment from models comes with the failure modes from Stop Running the SDLC on Models That Aren't Human attached: sycophancy (F2), hallucinated findings (F4), and a mysterious tendency to stop at fifteen findings regardless of how many exist (F6).

So the lifecycle's review phase isn't review. It's prosecution: fresh contexts chartered to refute, with the burden of proof on the finding. Just like a courtroom, but for agents it is four mechanisms, each traceable to a flaw or an exploit.


1. Refute charters, not review charters#

"Review this code" is a request for the model's agreement bias to find a comfortable resting place. You get a paragraph of praise, two style nits, and "overall this looks solid!" (sycophancy with a rubric).

"Find what's wrong with this, and if you find nothing, say so" is a different machine. The same compliance bias that makes self-review worthless locks onto the refutation charter and becomes relentless (E2: the flaw-as-feature). You're not fighting the model's eagerness to please; you're aiming it at the artifact instead of at yourself.

It's worth naming this as a general technique: an agent asked "any problems with this plan?" says no; an agent told "this project failed three months from now, write the postmortem" invents concrete, checkable risks. Don't ask models to evaluate. Ask them to explain a failure you've stipulated.


2. One lens per context#

Prosecution fans out: parallel reviewers, each owning exactly one dimension: correctness, security, contract conformance, the spec-vs-implementation diff, and one reviewer dedicated to auditing the tests the builder added during the build. (Builder-written unit tests are allowed, per Tests Are the Spec in the Only Language the Builder Can't Argue With; they're just work product, and work product gets prosecuted.)

Why one lens each? Context rot in miniature (F3): a reviewer with five concerns has the judgment of none. Asking one context to simultaneously hold security posture and off-by-one vigilance and contract conformance dilutes all three. Five narrow prosecutors beat one broad one, and they run in parallel anyway, so the wall-clock is identical.

Crucially, the prosecutors are fresh contexts that never see the builder's transcript, only the diff, the spec, and the contracts. The fresh context is the active ingredient (E4): a context that watched the code being written inherits the builder's assumptions, including the wrong ones. The critic's value is precisely its ignorance of how the sausage was made.


3. Findings are claims: reproduce or kill#

Here is the dominant failure mode of naive creator/critic loops, and almost everyone running them hits it within the first week: the critic hallucinates findings, and the builder dutifully fixes them.

F4 cuts both ways. A model chartered to find problems will find problems, including ones that don't exist, reported with exactly the confidence of the real ones. Forward unverified findings to the builder and it churns real code to address fake issues: refactoring around a race condition that can't occur, adding null checks for a state that's unreachable, or "fixing" an API misuse that was correct. Each fake fix is a fresh opportunity for a real bug, and the loop diverges instead of converging.

So every finding goes through a verifier stage before anything acts on it: a separate agent whose only job is to reproduce the finding or kill it. Write the failing test that demonstrates the bug. Trace the actual code path. Produce the input that triggers the race. No reproduction, no forward; the finding dies in triage. Evidence or it didn't happen applies symmetrically: it governs the critics exactly as it governs the builder.


4. Loop until dry#

A single-pass review systematically undercounts, and the undercount is weirdly consistent: models converge on ten to twenty findings and stop, regardless of how many exist (F6). That number is a training prior, not a measurement of your code. It's the answer to "about how many findings does a code review have?" rather than "how many problems are in this diff?"

The defense should exploit sampling diversity (E1): re-run the fan-out with fresh contexts, and the new pass surfaces findings the first one didn't (different samples, different blind spots). Repeat until two consecutive passes produce zero verified findings. Dry, twice - that's nice. That's the exit condition based on measurement rather than on the model's opinion of its own thoroughness.

With a budget cap (and this part matters) a prosecution loop that won't converge is information, not under-iteration. Hit the max-round budget without drying out and the correct move is to stop and escalate, because the loop is telling you the spec is contradictory or the partition is wrong. More iterations launder that signal into cost.

One demotion to record, because the branding around multi-vendor review oversells it: cross-model prosecution (a GPT-family model reviewing Claude-family work, or vice versa) is a real but third-order improvement. Different training data and different blind spots are worth having, but the active ingredients are fresh context and the refute charter, in that order. Teams blocked from multi-vendor access lose a few points of recall. Teams that cross-model with a shared context lose everything, because they kept the costume and discarded the mechanism.


Who reviews the reviewer?#

Everything above sounds rigorous. Here's the embarrassing question: what's your review stack's recall?

Not vibes: the number. Of the real bugs in a typical diff, what fraction does your prosecution stack actually catch? Which categories does it miss? Did last month's model upgrade silently change the answer?

Nearly every team running agentic review today trusts the stack blind. Findings come back, findings look plausible, findings get fixed, everyone feels reviewed. Whether the stack catches 85% of real bugs or 40% is, for almost everyone, unknown and unasked. And recall varies silently with everything: per repo, per language, per charter wording, and per model version. A model upgrade that improves code generation can degrade review recall in specific categories, and nothing in your pipeline will tell you.

The fix is the same trick mutation testing plays on test suites, aimed one level up: plant known bugs, measure what comes back.

Even better, this can be done mechanically:

  1. Take a real merged diff from your repo's history (real code, real style, and real noise).
  2. Plant N realistic bugs in it: mechanical mutation operators plus LLM-authored subtle ones, spread across categories (off-by-one, auth bypass, race condition, contract violation, and error-swallowing).
  3. Run your full prosecution stack against the planted diff, exactly as it runs in CI.
  4. Score recall and false-positive rate against the known plant list, per category.
  5. Exit 2 if recall falls below threshold.

The output turns "I do adversarial review" from a vibe into a number, and the per-category breakdown is where the action is. Low recall on races means you add a dedicated concurrency lens. Low recall on auth means the security charter needs sharpening. Re-run on every model change, and the silent regressions everyone currently absorbs unknowingly become diffs in a dashboard.

What planted bugs look like#

Calibration is only as honest as its plants. Mechanical mutants are necessary but not sufficient; a prosecution stack can learn to catch operator-swaps while staying blind to semantic rot. The subtle tier is LLM-authored: single-line edits, plausible at a glance, each producing a real behavioral bug. Here's the kind of plant that tier produces — a one-line edit to the hash-chain verifier in a provenance tool:

- if (entry.prev !== expected) {
+ if (entry.prev && entry.prev !== expected) {

One added truthiness guard. The chain verifier now silently skips verification for any entry missing its prev link, which means an attacker (or a confused agent) can break the evidence chain by omitting a field, and the verifier reports the chain valid. It survives casual reading because x && x !== y is a common defensive idiom; here the "defense" is the bug. Plants in the same vein: a Math.max over conflict signals demoted to Math.min (the forecast now reports the least alarming signal), a global regex flag dropped from a dedupe normalizer (only the first match normalized, duplicates leak through), !== weakened to < in a length comparison (reordered-but-same-length changes pass as identical).

Every one of these is one line. Every one is a real bug. A review stack's recall against this tier is the honest measure of what it would catch in your next real diff, and the first calibration run tends to be a humbling experience.

The meta-point: this is measurement replacing trust, the same move the whole lifecycle keeps making. Don't ask the reviewer if it's thorough (introspection is the thing models are worst at). Plant bugs and count (measurement is the thing arithmetic is best at). The calibration score even travels with the verdicts it qualifies: a review verdict means more when it carries the measured recall of the stack that produced it, so the score goes into the merge's evidence manifest alongside the test hashes and the rails-diff proof.


The gate#

Prosecution's exit gate, in full: Gate: zero verified open findings, two consecutive dry passes, rails still green, and the rails diff is empty, with that last item being the mechanical proof, promised in Tests Are the Spec in the Only Language the Builder Can't Argue With, that the builder never touched its own gates.

What passes through this gate has been built inside frozen rails, prosecuted by calibrated fresh-context critics until dry, with every finding reproduced or killed. That's one ticket. One lane.

The obvious next question is throughput: if one agent inside this structure is reliable, why not five at once? Because parallel agents that share state produce merge hell, contract drift, and integration bugs unless the partition is clean, and partition quality turns out to be measurable before you pay for the fan-out. Parallelism has exactly three dials, and the central fact about them is that they're not independent.


Part 5Three Dials: Parallel Agents Without Merge Hell

Read this part on its own page ↗

The series so far has built one reliable lane: spec interrogated (Two Human Gates and Everything Between Is Machine-Checked), rails frozen (Tests Are the Spec in the Only Language the Builder Can't Argue With), build prosecuted until dry (Prosecution, Not Code Review). This post is about running several lanes at once, which is where most multi-agent setups quietly catch fire.

Agentic parallel development has exactly three dials:

  • Cost - which models
  • Wall Clock - how wide to fan out
  • Accuracy - context and contract quality

The central fact about them is that they are not independent. Parallelism trades cost for wall-clock at constant accuracy only when the partition is clean. With a bad partition, it trades cost for negative accuracy: contract drift, merge hell, and integration bugs that surface days later. So the orchestration problem is, underneath, a partitioning problem, and most of this post is about making partition quality measurable before you pay for the fan-out.


First decision: lanes, not a boss agent#

The most common orchestration architecture is also the worst one: a frontier model "deciding what to do next." A model-as-scheduler is the most expensive, least reproducible scheduler ever built, and its context rots like any other (F3): by hour three it's scheduling based on a stale mental model of work it dispatched in hour one.

The rule: control flow is code; judgment is models. The orchestrator is a deterministic script (loops, DAG scheduling, gate checks) that spawns models where judgment is needed and never consults one about sequencing.

ORCHESTRATOR: deterministic script(no model; topological scheduler) CONTRACT DESK(frontier; pins DAG-edge contracts) BUILDER POOL(tier per ticket; 1 writer per partition) PROSECUTION POOL(shared; fresh ctx; calibrated) INTEGRATOR LANE(cheap + deterministic; sequential merge/rebase pipeline)
Click diagram to zoom

Notes on the lanes:

  • Prosecutors are pooled, not paired with builders. The fresh-context requirement (E4) means a prosecutor gains nothing from familiarity with "its" builder; dedicated pairing just buys idle time.
  • Builders are single-writer per partition. Parallel construction on shared state produces merge hell; parallel search (bug hunting, design alternatives) is where parallelism is nearly free (E1). Most multi-agent disasters come from confusing these two: applying search-parallelism to a construction problem and paying for it in coordination noise.
  • The integrator lane is sequential by necessity (merges serialize). It is the system's bottleneck, and that turns out to determine everything about fan-out width.
  • The contract desk gets the frontier model, always. Explanation below.

The cost dial: route by escape cost, by ladder, and by float#

Two Human Gates and Everything Between Is Machine-Checked introduced the principle: model tier is a function of the cost of detecting an error, not of task prestige. Made mechanical, it has three parts.

Route by rail density#

For each ticket, compute how much of its output is deterministically checked: test coverage over its declared file scope, type strictness, and contract tests on its DAG edges. High rail density → errors are caught instantly and regeneration is cheap → the cheapest model that clears the gates is correct. Low rail density (contracts, migrations, anything uncovered) → errors are expensive to find → frontier. The routing quantity is expected cost of an escaped error: the probability an error survives all gates, times the blast radius.

Escalation ladders, not static assignment.#

Start cheap; on gate failure, regenerate one tier up with the failure appended to the ticket as known-dead-ends. Never continue the failed context at the higher tier; escalation is regeneration, not rescue (F8). Toy math: a cheap model at 0.1 cost-units with 60% first-pass rate, a mid model at 1.0 with 90%, so the ladder expects ≈0.55 units versus 1.0 for always-mid, a ~45% cut. The ladder costs latency on the failures, though, which produces the one genuinely non-obvious routing rule:

Route by DAG float.#

Critical-path method, applied to model selection. Every ticket has float (slack before it blocks anything downstream). Tickets on the critical path skip the ladder and go straight to the highest-first-pass tier, because a retry there delays the entire delivery. Tickets with float greater than expected retry latency ride the ladder, because their retries are absorbed by slack and cost nothing in wall-clock. Same ticket content, different correct model: position in the graph, not the prestige of the work, decides.

And the priors come from records, not vibes: every gate in the lifecycle should log model × ticket-category × first-pass outcome into the merge's evidence manifest. That ledger is the routing table: per-repo, empirical, and self-tuning. Which means routing stops being a judgment call and becomes a gate - a pure function that reads the ticket (scope, DAG position, rail density) plus the manifest history and emits {model, mode: ladder|direct, budget}, no model in the loop. The case worth savoring is the one where the gate refuses to answer: a ticket whose rail density is below the floor for any cheap tier isn't a routing problem; it's an under-railed ticket wearing a routing costume, and that's worth knowing before spending either way.


The wall-clock dial: forecast conflicts, derive the width#

The schedule is a Directed Acyclic Graph (DAG), not a list. Decomposition's output is tickets plus edges, every edge carrying an explicit contract. Scheduling is then topological: all ready nodes run concurrently, and completion events (not phase barriers) trigger the next dispatch. Barrier waves ("finish all of phase 2, then start phase 3") waste exactly the idle time the slowest ticket imposes on the fastest.

Predict conflicts; don't resolve them.#

Two parallel tickets that touch the same file were never parallel; they were a merge conflict scheduled in advance. Conflict probability per ticket pair is computable before any agent runs, from four signals:

  1. Declared file-scope overlap: hard veto; overlapping scopes serialize, no model needed.
  2. Import-graph radius: A writes files B's scope imports → elevated risk; pin the shared interface first.
  3. Historical co-change coupling: files that co-commit frequently are logically coupled even when the import graph says otherwise. Mined once from git history, refreshed each cycle.
  4. Namespace collisions: the class file analysis can't see two branches with zero shared files that still break the merged build. These include route-segment conflicts (Next.js forbids [pk] and [voteKey] at the same path level), duplicate exported symbol names, or colliding migration sequence numbers. Field-verified failure mode: the scope overlap was clean, and the merge still burned hours. Forecast these by diffing declared namespaces (routes, exports, migration ids) rather than just declared files.

All four are computable before any agent runs, which makes the forecast a gate of its own: run the signals, validate the DAG, compute per-ticket float (the same float the routing gate wants), and emit a dispatch schedule plus a width recommendation. The failure case equals a partition unsafe at the requested width, with offending pairs named. Twenty-minute builds over four-minute integrations ≈ width 5. The folklore number is this ratio for typical ticket sizes, observed without being derived. Corollary worth more than the formula: you raise useful width by making integration faster (using build caching, cheap re-green suites, or smaller rebase surfaces) rather than spawning more builders into a queue.

Speculative execution, where "pinned" means merged.#

Dependency edges don't have to serialize work, only truth. But field experience sharpens what "pinned contract" must mean: a contract floating in a plan doc is not pinned. A contract is pinned when the foundation is built first and merged to main (schema, shared types, query functions) before the fan-out, with those foundation paths auto-appended to every parallel ticket's frozen rails. Builders consume the foundation; they never reinterpret it. (Read the actual query function before writing consumer code; never guess property names.) With the foundation merged, downstream tickets build against it while upstream features build in parallel, exactly like issuing instructions against a register promise. If upstream must break the contract, downstream regenerates, and regeneration is cheap (E5). This recovers most of the parallelism the DAG appears to forbid. This is why the contract desk gets the frontier model: contract stability is what makes the whole speculative schedule solvent.


The accuracy dial: measure ambiguity, don't introspect it#

The interrogation phase (Two Human Gates and Everything Between Is Machine-Checked) has a structural weakness it shares with every "ask me clarifying questions" pattern: it asks the model to know what it doesn't know, which is the exact metacognition LLMs are the worst at. One interrogator in one rotting context, question quality unmeasured, fired once at the start, is blind to the place parallel accuracy actually dies: the edges between tickets.

Measurement replaces introspection, using the same property that powers everything else in this lifecycle (E1, sampling diversity as an instrument):

  1. Fan: give the raw request to N cheap agents in fresh contexts (3-5): "write the spec you would execute." No questions allowed; force each to commit to a reading.
  2. Diff the readings. Where all N agree, the request is demonstrably unambiguous: ask the human nothing there. Where they diverge, that divergence is a measured ambiguity, pre-shaped as a question: "Your request has three live readings: A, B, C. Which did you mean?"
  3. Fold and re-fan until divergence drops below threshold.
  4. Exit on convergence, not confidence. The residual divergence is a number (the spec's ambiguity score) and downstream gates can gate on it.

Every question is provably load-bearing (it exists only because it changes what would be built). Multiple-choice beats open-ended for the human: picking reading B takes five seconds. And the agreement set is free spec: everything all N readings shared becomes the draft body, needing a skim instead of authorship.

Two extensions aim squarely at parallel work:

  • Edge interrogation: run the same fan per DAG edge, where N agents independently author the interface implied by the two adjacent tickets. Divergence there is contract ambiguity, the precise quantity that breaks speculation and poisons merges. A converged edge is what licenses speculative execution on it.
  • The ambiguity router: when a builder hits a question mid-flight, fan three cheap agents on it before any human sees it. If they agree, it was confusion, not ambiguity: answer mechanically with zero interrupts. If they diverge, it's real, and the human gets it as multiple choice. In a 5-wide run this is the difference between the human as interrupt-driven bottleneck and the human as occasional adjudicator.

Field notes, so you don't rediscover them#

Hard-won, encoded here rather than re-learned:

  • Preflight permissions. Before any fan-out, dry-run every operation class the fleet will use (git, worktree add/remove, build commands, agent spawn) so approval prompts front-load into one batch. A permission prompt mid-flight is a hidden serialization point: one blocked agent × N teammates = N stalls. (Batch them into a single preflight pass before the fan-out.)
  • In-flight validators are a different organ than prosecution. A validator paired with a long-running builder, reviewing as the work happens, catches drift hours before the gate, and does not replace prosecution, which still runs fresh-context at the gate. Build gate proves it compiles; prosecution proves it does what the ticket asked. Different questions, both mandatory.
  • Pull, don't push. Idle builders claim the next unblocked ticket from a shared queue (work-stealing) instead of receiving static assignments, absorbing the duration variance that static assignment converts into idle time. Sizing heuristic: 2-3 tickets per builder.
  • Integrator craft: merge order is foundation → shared packages → apps, first-done-first-merged within a tier. After a squash-merge to main, never git rebase main (it replays pre-squash commits): cherry-pick your unique commits onto a fresh branch. And disable formatter hooks during conflict resolution, then grep for stale conflict markers: formatters mangle <<<<<<< blocks into syntactically valid garbage.

The dials, set#

KnobDefaultOverride when
Fan-out widthmin(forecast-certified, build÷merge ratio), typically 3-5Integration made faster → raise
Ticket size~1 useful context windowHigh integration overhead → bigger; low rail density → smaller
Builder modelladder if float > retry latency, else direct best-tierNo manifest history yet → mid-tier direct, collect priors
Contract deskFrontier, alwaysNever, because contract stability funds the speculative schedule
Prosecutor poolMid-tier, calibrated, sharedCalibration shows a category blind spot → add a frontier lens there
SpeculationOn, for any parallax-converged edgeEdge ambiguity above threshold → serialize that edge

Notice what this table is: the three dials, each set by a measurement (forecast, float, calibration, ambiguity score) rather than by anyone's intuition. That's the through-line of the whole post. Orchestration intuitions ("about four agents feels right," "give the hard ticket to the big model") keep turning out to be shadows of computable quantities, and the computation is always cheaper than one bad merge.

Everything so far makes a single run reliable and parallel. None of it yet explains why run fifty should be cheaper than run five: why the lifecycle compounds instead of just repeating. That's the phase everyone skips, and the tools that make skipping it visible on a dashboard.


Part 6The Lifecycle That Gets Cheaper Every Run

Read this part on its own page ↗

Five posts in, the lifecycle can take a request from interrogated spec (Two Human Gates and Everything Between Is Machine-Checked) through frozen rails (Tests Are the Spec in the Only Language the Builder Can't Argue With), prosecution-until-dry (Prosecution, Not Code Review), and a calibrated parallel fan-out (Three Dials: Parallel Agents Without Merge Hell) to a merged, verified change. Run it again next sprint and it works again.

That's not good enough, and this post is about why. A lifecycle that merely repeats leaves the defining economic property of agents on the table. Human teams compound by default: people remember, develop taste, and stop making last quarter's mistakes. Agents remember nothing. Every run starts from zero unless something deliberately carries the lessons forward. Skip that something and you get the signature cost curve of most agent adoption: spend flat run-over-run, the codebase bloating quarter over quarter, and the same bug categories found (and paid for) every sprint.

The something is Phase 7, Distill, and it has two halves.


Half one: simplify (the architectural review moved to where the information is)#

Models are verbose and duplicative; they reinvent what exists three files away (F7 from Stop Running the SDLC on Models That Aren't Human). Agent-generated codebases tend to run meaningfully fatter than necessary, and the fat compounds: every future agent pays input tokens to read it, and every future context window carries it as noise.

The counterintuitive move is when to fight this. The instinct (enforcing "Don't Repeat Yourself" (DRY) at authoring time and policing duplication in review) is wrong for agents, for a reason worth spelling out: deduplicating before the code exists is speculative. You're guarding against duplication that sampling non-determinism may never materialize where you guarded. Deduplicating after merge is mechanical: the duplication is sitting there, findable by analysis, removable under tests.

So the simplify pass runs post-merge, under the now-green, still-frozen rails: dedupe, extract shared utilities, clarity-over-cleverness rewrites, dead-code removal. The rails define "behavior preserved," which is what lets a cheap-to-mid model do this safely: the rails carry the risk, exactly as designed in Tests Are the Spec in the Only Language the Builder Can't Argue With.

Notice what this is, in classical terms: the architecture/design review, moved from the front of the lifecycle to the back. The SDLC put design review up front because rework was expensive: you reviewed the blueprint because rebuilding the house was ruinous. Rework is now nearly free (E5), but information still accumulates: after the merge you know what was actually built, what actually got duplicated, and which abstractions actually repeat. Review with full information beats review with speculation, so the expensive analysis moves to where the information lives. Same review, better-informed, and the lifecycle's economics are what made the move legal.


Half two: mine (every lesson paid for exactly once)#

Here's the diagnostic that motivates everything in this half: if your prosecution spend trends up over time, your lifecycle is re-buying the same lessons. The same error-swallowing pattern found in March, April, and May. The same missing-auth-check caught (at LLM-review prices) every single sprint. Each catch costs dollars of fan-out and verification. The lesson was learned, billed, and thrown away, three times.

A lesson foundry is the ratchet that stops this. Verified prosecution findings accumulate in a JSONL ledger across runs (the adversarial-review output feeds it naturally). The foundry clusters recurrences, then routes each cluster to its cheapest permanent defense:

  • Deterministic-able → author a lint rule or grep gate, with a test for the rule, PR'd like any other code. The recurring finding is now caught at CI speed, for free, forever.
  • Contextual (can't be a lint, needs judgment) → emit a skill candidate into the skill-mining pipeline, so every future builder loads the convention before writing code.
  • Spec-gap (the bug existed because nobody asked) → append a question to the interrogation template, so the spec phase asks it on every future feature, forever.

Every defense gets fresh-context validation before landing: the foundry's output is code and gets prosecuted like code. The effect is a one-way ratchet: each lesson is paid for exactly once, then demoted from probabilistic detection (LLM review, ~dollars per catch) to deterministic detection (lint, ~free forever). Run by run, the prosecution fan-out finds less because the lint layer catches more, which is precisely the cost curve bending down.

This generalizes past code, in two directions worth naming. Mine the harness: orchestration bugs (the plan-approval loop that livelocks spawned agents, the formatter that mangles conflict markers) are lessons too; the ledger carries them alongside code findings. And mine the institution: a rejection-mining pass scans historical PR review threads, declined PRs, and security/platform rejection docs; clusters each gatekeeper's recurring "no"s; and compiles them into prosecution lenses and pre-flight checklists. "Would security reject this?" becomes a question answered in seconds pre-submit instead of in days post-queue. Every recurring institutional objection becomes a gate the work passes before it reaches the institution.


The cache problem: knowledge rots#

Banked knowledge has a failure mode of its own, and it's nastier than having no knowledge at all.

Every artifact a future agent reads is a cache, and caches need invalidation. Skills, spec templates, memory files, and conventions docs all go stale as the codebase moves. And a stale skill is worse than no skill: it's misinformation delivered with the voice of authority, loaded automatically into every future agent's context, asserting that the command is npm run deploy when the script was renamed two months ago. The agent trusts it (that's what the knowledge layer is for) and confidently does the wrong thing.

Skill-rot detection is the invalidation sweep: for each skill file, extract its verifiable claims (commands, file paths, package versions, and API names) and check them against the current repo, mechanically where possible, cheap-model where not. Stamp last-verified on what passes; exit 2 with the stale list. Weekly, in CI, like any other freshness check. The mining half of Phase 7 re-mines idempotently for the same reason: refresh what drifted, delete what died.


The free re-audit nobody runs#

One more compounding mechanism, this one exploiting the outside world's progress instead of your own.

Everyone reviews new code with the current model and never looks back. But every frontier model release is a free re-audit of your existing codebase: the new model finds what the old one missed, and the old one did miss things (Prosecution, Not Code Review measured exactly how much). The code merged under last year's review stack carries last year's escape rate, sitting there, waiting.

A model ratchet schedules the harvest: on model release (or monthly), re-run the prosecution fan-out over main's hot paths (ranked by churn × criticality) with the newest models. Verified findings become tickets and feed the lesson-foundry like any other. Codebase quality ratchets monotonically with the frontier, for the cost of a cron job. Pairs naturally with calibration: measure the new model's recall first (Prosecution, Not Code Review), then aim it at the backlog.


The unit of account#

None of the above shows up on the metric most orgs actually track, which is why most orgs skip Phase 7. So, the economics, stated bluntly:

The unit of account is cost per merged, verified change, not tokens per developer per month. Token-efficiency improvements that lower merge quality are losses wearing savings costumes. And token quotas as cost control cap the wrong variable entirely: a quota-pressured developer cuts the prosecution phase first, because it's the most visible spend, and it's also the most valuable spend in the system. Govern cost-per-merged-verified-change; let the gates, not the wallet, end loops. (For scale: $1k/week of agent spend annualizes to roughly 15% of a senior engineer's loaded cost, a real fraction, but the wrong variable to cap. The right question was never "is $1k too much?" It's "did it merge, verified?")

With the right unit of account, the spend shape becomes a diagnostic instrument. Four readings:

  • Spend concentrated in the build phase → the team is re-exploring the codebase every run: missing skills, oversized tickets, or no distill phase. Build should be the cheap part (that's what the barbell (Two Human Gates and Everything Between Is Machine-Checked) means).
  • Prosecution spend trending up → the foundry isn't converting findings into lints and skills; you're re-buying lessons. This is the most common broken loop and the easiest to confirm: look for the same finding category in three consecutive runs' ledgers.
  • Prosecution loops hitting max iterations → specs are underdetermined. The problem is in interrogation; the bill shows up in prosecution. Fix the phase upstream of where the symptom presents.
  • Spend flat run-over-run → the compounding loop is broken somewhere. The entire point of Phase 7 is that this curve bends down; flat is a failure signal, not a steady state.

That last one deserves its own sentence, because it's the thesis of this post: flat cost is failure. A healthy agentic lifecycle gets measurably cheaper per change as the skill library grows, the lint layer thickens, the interrogation templates accumulate questions, and the routing priors converge. Run N+1 cheaper than run N, not as aspiration, as the observable output of a working ratchet, visible in the ledger.


The full loop, closed#

Trace one finding all the way around, because this single trip is the whole argument: a prosecutor catches an error-swallowing pattern (Phase 5, dollars). The verifier reproduces it; the builder fixes it; it merges (Phase 6). The foundry clusters it with two prior occurrences and authors a lint rule with a test (Phase 7, dollars, once). Next sprint, a builder introduces the same pattern, and CI catches it in milliseconds, for free, before prosecution ever runs. The sprint after that, the interrogation template asks about error handling up front, and the pattern never gets written at all.

Detection migrated from expensive-and-probabilistic to free-and-deterministic to prevented-by-specification. That migration, repeated across every recurring lesson, is what "the lifecycle compounds" means mechanically. Capability is migrating from the model tier into the artifact layer (skills, lints, templates, and priors) where it compounds instead of being re-billed per token.

Which sets up the last post's question. If capability lives in the artifact layer, how much model do you actually need? Make every gate in this series runnable and the whole lifecycle could hold to a deliberate constraint: mid-tier models everywhere the rails are dense, and frontier only where errors escape detection. The doctrine generalizes, it has a one-line form ("you never need a model smarter than the gate it must pass"), and it comes with an honest account of what you give up.


Part 7The ADLC Toolkit

Read this part on its own page ↗

Six walls of text and no code - that doesn't sound like @voodootikigod. I introduce to you the ADLC toolkit that enforces this lifecycle was built by the lifecycle (eighteen tools, constructed in parallel by a deterministic workflow script that pipelined each one through build → prosecute → fix, exactly the loop [Two Human Gates and Everything Between Is Machine-Checked through Three Dials: Parallel Agents Without Merge Hell] describe).

The shape of that run, briefly, because it's the whole series in miniature. A frozen contract came first: a small shared core (@adlc/core, which manages LLM calls, git plumbing, CLI conventions, and the findings ledger) was built, tested, and merged before any fan-out, then appended read-only to every tool's ticket. This applied the "pinned means merged" rule from Three Dials: Parallel Agents Without Merge Hell literally. Each tool then got a fresh builder agent with its ticket and rails; each build was prosecuted by fresh-context reviewers; verified findings looped back as fix tickets; and the orchestrator was a workflow script (control flow as code, judgment as spawned models, and no boss agent deciding what happens next). The tools came out the other end zero-dependency, npx-runnable, with deterministic exit codes (0 = pass, 2 = gate fails) so every one can sit in CI.

And then the toolkit was aimed at itself: review-calibration planted bugs in the toolkit's own diffs to measure whether the prosecution stack that built it would catch them, including a one-line truthiness guard in gate-manifest's own hash-chain verifier that made the provenance tool silently skip verification (Prosecution, Not Code Review shows the diff). Dogfooding doesn't get more circular than calibrating the reviewer against the tool that proves the reviewer ran.


The toolkit, by phase#

Every tool earns its place the same way every phase did: it traces to a model flaw defended or a model property exploited (Stop Running the SDLC on Models That Aren't Human). Same DNA throughout: small, fresh contexts by construction, gate-shaped.

Specify

ToolWhat it gates
spec-lintEvery acceptance criterion must name its verification method. Criteria without one are wishes: exit 2 lists the wishes
premortemFresh frontier context, one charter: "this project failed three months from now; write the postmortem." Inverted sycophancy as a stress test
parallaxMeasured ambiguity: N independent readings of the request, divergence becomes multiple-choice questions, convergence becomes a score the gate can check
coldstartEach ticket to a cheap fresh model: "list everything missing to execute this." Non-empty list = underspecified ticket, exit 2

Rail + Build

ToolWhat it gates
rails-guardMechanical rail freeze: blocks builder edits to test/contract/CI paths, emits the rails-diff-empty proof, greps for new skip/suppress markers
hollow-testDiff-scoped mutation testing: surviving mutants are proof of hollow coverage. The honest replacement for coverage %
preflightEnvironment determinism before fan-out: dry-run every operation class the fleet will use, front-load the permission prompts
merge-forecastPartition safety: pairwise conflict scoring (file scope, import radius, co-change history, namespace collisions), float computation, width recommendation
model-routerTier by escape cost and DAG float, priors from the manifest ledger. Ladder when float absorbs retries, direct when on the critical path
flail-detectorThe two-strike rule, mechanized: detects loop signatures, kills the session, appends dead-ends to the ticket, regenerates fresh. Second strike escalates to decomposition
consensus-fixN-version programming for hard bugs: fan N fresh agents at the failing test, agreement = confidence, divergence = the spec is ambiguous about something load-bearing

Prosecute

ToolWhat it gates
review-calibrationPlanted-bug recall of the whole review stack, per category. Turns "I do adversarial review" into a number; re-run on every model change

(The prosecution loop itself runs on adversarial-review, which predates this toolkit: fresh-context cross-model review with deterministic exit codes for CI.)

Integrate

ToolWhat it gates
behavior-diffDiff in behavior space, not code space: API responses, rendered routes, CLI outputs, before vs. after. The 5,000-line code diff becomes six human-readable behavioral items
gate-manifestThe evidence chain: every gate appends a signed entry (test hashes, rails-diff proof, prosecution verdicts with the calibration score that qualifies them, models used, and spend per phase). A merge ships with its provenance

Distill

ToolWhat it gates
lesson-foundryThe ratchet: clusters recurring verified findings, routes each to its cheapest permanent defense (lint rule, skill candidate, or interrogation question)
skill-rotCache invalidation for knowledge: extracts each skill's verifiable claims, checks them against the repo, stamps last-verified, exits 2 with the stale list
model-ratchetThe free re-audit: on model release, re-prosecute main's hot paths with the newest models; verified findings feed the foundry
rejection-miningMines gatekeepers' recorded "no"s from PR history and rejection docs into prosecution lenses and pre-flight checklists

(skill-mining, also predating the toolkit, handles the skill-extraction half of Distill.)


The frontier-free doctrine#

Here's the constraint that shaped the toolkit and turns out to be a doctrine: the lifecycle must hit its accuracy targets with mid-tier models (Opus, Sonnet, Haiku-class, no frontier-of-frontier access). Not as a degraded mode but as the design center. (It's also the common enterprise reality: approved-model lists, quota ceilings, procurement lag.)

The premise: the gap between a mid model and a frontier model is almost entirely a gap in single-pass judgment (depth of insight per forward pass, coherent horizon, and knowing-what-it-doesn't-know). The doctrine: at every point where the lifecycle appears to need single-pass judgment, buy the same outcome with structure instead. Five substitutions:

  1. The generator-verifier gap is the engine. Recognizing a correct artifact is easier than producing one; checking one deterministically is easier still. Generate wide and cheap, verify deterministically, select with a mid model. The quality of output decouples from the generator and couples to the verifier; this lifecycle's verifiers are tests, types, contracts, and hash chains: model-free. You never need a model smarter than the gate it must pass. That's the doctrine in one line.
  2. Search replaces insight. What a frontier model produces in one pass, a mid model produces as the best of N diverse attempts: judge panels for design, consensus for hard bugs, or loop-until-dry for review breadth. And review-calibration makes the exchange rate a number: if a 3-pass mid-tier prosecution stack shows 0.85 planted-bug recall and a single frontier pass shows 0.6, the stack is the more capable reviewer. Measure the stack, never the model.
  3. Decomposition replaces horizon. Ticket size is tier-indexed: a cheap model that only ever sees a few thousand tokens of well-railed ticket is not operating below the frontier; it's operating below its own degradation point, which is the only line that matters.
  4. Banking replaces presence. Rent the big model occasionally to mint structure (contracts, skills, templates, lints) then spend mid-tier inside that structure indefinitely (The Lifecycle That Gets Cheaper Every Run is this substitution, run as a flywheel). Capability migrates from the model tier into the artifact layer, where it compounds instead of being re-billed per token.
  5. Measurement replaces metacognition. The capability mid models most lack is knowing what they don't know, so never ask. parallax swaps "do you have questions?" for divergence-of-N-readings; consensus-fix swaps "are you sure?" for agreement statistics; coldstart swaps "is this clear?" for an enumerated gap list. None need a smarter model. They need more samples and a division operation.

And the sixth substitution is the one this series opened with: humans are the frontier tier. The two human gates sit exactly where frontier judgment would otherwise go ("is this what I meant?" and "is this what I meant, running?") because the human is the ground truth for intent. The tooling (behavior-diff, the manifest, and parallax's multiple-choice questions) exists to compress what the human must absorb so the minutes stay minutes.

The honest loss account, because doctrines without one are marketing: you give up single-pass architectural elegance (mitigated by judge panels + premortem + the human at the spec gate, and the residue is real); subtle cross-cutting bug intuition (loop-until-dry raises recall asymptotically, model-ratchet schedules the deep read for whenever a stronger model ships); latency (N passes are slower than one brilliant pass, which is recovered by parallelism); and long-horizon refactors that resist decomposition (the genuinely hard residue: serialize them, best available model, densest rails, in-flight validator, and accept that ~5% of work runs at maximum supervision). Net: a capability shortfall converted into a compute-plus-process bill, with gates keeping the conversion honest. And when the constraint lifts, nothing is wasted: every mechanism here amplifies a frontier model exactly the way it amplifies a mid one.


Adoption: relief first, lifecycle later#

The field wisdom that outranks everything else here: teams do not adopt platonic lifecycles; they adopt relief from their worst pain point, then ask what else hurts. Sequencing for a real team:

  1. Prosecution of existing PRs (Prosecution, Not Code Review standalone). Highest pain: nobody wants to review the 5,000-liner. This requires zero workflow change, and trust gets built on verified findings the team can check themselves. Include finding-verification from day one: a single hallucinated finding wastes an hour of human time and burns a week of credibility.
  2. Rails (Tests Are the Spec in the Only Language the Builder Can't Argue With). "You hate writing tests? The agent writes them from the spec; you audit them once." This quietly installs the trust anchor everything else hangs on.
  3. Interrogation (Two Human Gates and Everything Between Is Machine-Checked). Once the team has watched agents miss implicit requirements a few times, the case for spec interrogation makes itself.
  4. Full loop with parallelism (Three Dials: Parallel Agents Without Merge Hell) and distillation (The Lifecycle That Gets Cheaper Every Run). Last, because worktree fan-out and the compounding flywheel only pay off once 1-3 are habits.

The anti-pattern is mandating the full lifecycle org-wide on day one. The ceremony overhead lands before the compounding gains do, quota anxiety kicks in, and the org concludes "agents don't work here," which, as Stop Running the SDLC on Models That Aren't Human argued, is the conclusion of teams that pointed human-shaped process at a non-human failure profile. Don't hand them a second wrong-shaped process at higher ceremony.


The through-line#

Seven posts, one move, made over and over: replace trust with structure, and structure with measurement.

Don't trust the builder's claim: gate it with a test it cannot edit. Don't trust the reviewer's thoroughness: plant bugs and count. Don't trust the model's questions: fan out readings and diff them. Don't trust the partition: forecast the conflicts before paying for the fan-out. Don't trust the knowledge layer: verify its claims weekly and stamp the date. Don't trust the org's memory: cluster the findings and compile them into lints. And don't trust the lifecycle itself: give it a unit of account (cost per merged, verified change) and check that the curve bends down.

None of it requires smarter models. All of it gets better with smarter models: every mechanism amplifies whatever you run through it, which is what makes it a lifecycle rather than a workaround. The SDLC took sixty years to accrete its defenses against human nature. I get to build the agentic one deliberately, from a flaw inventory, in public, with gates that prove themselves in CI.

The doctrine is one document and the tools are one repo: github.com/voodootikigod/adlc. The shared core is @adlc/core; every gate is zero-dependency and npx-runnable without a global install. Run npx coldstart on your next ticket, or npx review-calibration against your current review stack; the first calibration number is reliably humbling, and it's the right place to start. Everything here traces to a flaw or an exploit; if you find a phase that doesn't, cut it, and if you find a flaw without a phase, that's the next tool.


Part 8ADLC vs. the Enterprise SDLC

Read this part on its own page ↗

The first seven posts make a deliberately sharp claim: do not run a human-shaped software development lifecycle on non-human builders. Models fail differently, so the lifecycle around them has to be built from a different flaw inventory.

That does not mean the enterprise SDLC was stupid. It means it was optimized for a different machine.

The traditional enterprise software development lifecycle is a control system around human teams operating in expensive coordination environments. Its rituals are not random: intake, requirements, design review, implementation, QA, security review, change approval, release management, incident review. Each exists because at enterprise scale, software is not just code. It is risk transfer, budget allocation, auditability, ownership, compliance, support, and organizational memory.

The Agentic Development Lifecycle does not remove those needs. It changes where the expensive human attention goes, what gets automated, and which artifacts become load-bearing.

So the useful question is not "ADLC or SDLC?" It is: which parts of the SDLC are still defending real enterprise risk, and which parts are only compensating for human labor constraints that agents no longer have?

The boundary looks like this:

Enterprise SDLC: organizational governance ADLC: agentic production core Intake and prioritization Risk, compliance, and ownership Release, change management, and support Interrogate and approve spec Write and freeze rails Agent build Fresh-context prosecution Behavioral acceptance Distill lessons
Click diagram to zoom

The traditional enterprise SDLC, stated generously#

A good enterprise SDLC does five jobs.

It clarifies intent. Product requirements, architecture documents, design reviews, and stakeholder sign-off exist because "build the thing" is almost never enough. Enterprises carry latent requirements: regulatory rules, support obligations, data retention, localization, procurement constraints, SSO, accessibility, observability, rollback, and the one integration owned by a team two divisions away.

It allocates accountability. Human owners sign off because the organization needs someone answerable for decisions. A release that breaks revenue recognition or leaks customer data cannot be explained by "the workflow passed." Someone accepted the risk.

It controls change. CABs, release windows, QA gates, test plans, and deployment checklists are crude in the small and necessary in the large. They coordinate shared infrastructure and protect customers from surprise.

It preserves memory. Tickets, design docs, ADRs, runbooks, postmortems, and test plans are how a company remembers why the system is shaped the way it is after the humans rotate.

It satisfies external trust. Auditors, regulators, customers, insurers, and internal risk teams need evidence that controls exist and were followed. The artifact trail is part of the product.

That is the steelman. The SDLC is not just waterfall in a tie. It is an organizational risk machine.

But it is also a machine full of assumptions about human throughput. Humans are slow to build, expensive to rework, limited in parallelism, socially fragile in review, and bad at preserving state without process. The SDLC evolved around those constraints. Agents change enough of them that a direct transplant becomes wasteful at best and dangerous at worst.


Where ADLC diverges#

The enterprise SDLC usually treats implementation as the scarce center of the process. Requirements are negotiated up front because building the wrong thing is expensive. Design review happens before implementation because rework is expensive. QA trails implementation because humans need time to finish a coherent unit of work. Code review sits after the diff because another human has to inspect what the first human wrote.

ADLC inverts that cost structure.

Implementation is no longer the expensive center. Misbuilding is. The model can produce a large diff quickly, but it can also produce the wrong large diff quickly, backed by a confident self-report and a green suite it quietly weakened unless the rails were protected. So ADLC spends heavily at the edges: spec interrogation before build, prosecution after build, distillation after merge. The middle is deliberately cheap because the middle is heavily gated.

In an enterprise SDLC, tests usually verify the implementation. In ADLC, tests are the executable contract the builder is not allowed to edit. That is a different relationship. A human developer can be told "do not weaken the test" and mostly comply because reputation, shame, and review norms exist. A model under gate pressure has no such stabilizers. The control has to move from policy to mechanism: frozen rails, diff proofs, deterministic gates.

In an enterprise SDLC, review is often a social and architectural act: maintainers inspect code, transfer knowledge, enforce style, and catch defects. In ADLC, review becomes prosecution: fresh contexts, narrow lenses, refute charters, verified findings, loop-until-dry. Knowledge transfer cannot be assumed to happen by reading a pull request. It has to be mined into skills, lints, templates, and runbooks after the fact.

In an enterprise SDLC, human approval often appears at many points because the process cannot otherwise prove the intermediate state. ADLC tries to reduce mandatory human approval to the two moments where humans are actually the ground truth: "is this what I meant?" and "is this what I meant, running?" Everything between those moments should produce evidence, not requests for trust.

That is the core divergence: the SDLC distributes human judgment across the lifecycle; ADLC compresses human judgment around intent and behavior, then replaces intermediate trust with machine-checkable evidence.

Traditional enterprise SDLC Agentic Development Lifecycle Requirements Design review Human implementation QA Human code review Approval and release Human gate: intent Executable rails Agent implementation Machine gates Prosecution Human gate: running behavior Distill into controls templates, lints, skills, priors
Click diagram to zoom

The advantages of ADLC#

The first advantage is throughput under control. Agents can build, retry, review, and refactor faster than human teams can coordinate the same work. But raw speed is not the point. Raw speed without gates is just faster incident creation. ADLC's advantage is speed bounded by rails: small tasks, deterministic checks, fresh-context review, and evidence manifests that travel with the change.

The second advantage is review depth. Human review attention collapses on very large diffs. An enterprise can pretend otherwise, but the 5,000-line pull request is rarely read with meaningful recall. ADLC attacks that with parallel prosecution lenses and verification. It does not ask one tired reviewer to notice everything. It decomposes review into repeatable searches, measures recall with planted bugs, and reruns until the stack comes up dry.

The third advantage is economic compounding. A traditional SDLC improves when people learn, but that learning is lossy: people leave, teams reorganize, conventions drift, postmortems decay. ADLC has no implicit memory, so it must make memory explicit. Verified findings become lint rules, skill files, spec questions, prosecution lenses, and routing priors. If the distillation loop works, the same class of issue gets cheaper every time until it disappears into a deterministic gate.

The fourth advantage is better use of senior humans. Enterprise SDLCs often spend senior attention on diff reading, meeting attendance, status reconciliation, and late-stage re-explanation. ADLC spends that attention on spec approval, behavioral acceptance, escalation, and control design. That is a more honest use of scarce judgment.

The fifth advantage is auditability, if implemented seriously. A mature ADLC run can produce an evidence chain richer than a traditional ticket: spec hash, rail hash, rails-diff-empty proof, test results, mutation survivors, prosecution verdicts with calibration scores, model and tool versions, cost by phase, and human acceptance. That is not less governable than SDLC evidence. It is potentially more governable because it is produced by the workflow instead of reconstructed after the fact.


The disadvantages of ADLC#

The first disadvantage is ceremony density. ADLC is not "tell the agent to code and go to lunch." Done seriously, it has more mechanical gates than many human teams tolerate today. For trivial work, the full loop is too much. The lifecycle needs routing by risk and blast radius or it will become the same kind of process theater it criticizes.

The second disadvantage is tooling maturity. Enterprises already have SDLC infrastructure: Jira, GitHub, CI, SAST, change management, release approvals, audit exports. ADLC needs new control surfaces: rail freezing, review calibration, ambiguity measurement, model routing, lesson mining, skill invalidation, evidence manifests. Some can be approximated with existing tools. Some are new. Until the tooling is boring, adoption cost is real.

The third disadvantage is cultural legibility. A CAB understands a test report, a security scan, and named human approvers. It may not yet understand "two consecutive dry prosecution passes with 0.82 planted-bug recall." ADLC has to translate its evidence into enterprise control language or it will be treated as clever automation outside the official risk system.

The fourth disadvantage is uneven fit. ADLC is strongest where behavior can be specified, tested, and decomposed. It is weaker for open-ended product discovery, ambiguous UX taste, deep platform migrations, cross-system political negotiation, and architectural bets whose correctness is only visible months later. Those do not become impossible. They require heavier human gates, stronger design alternatives, longer-running validation, and sometimes the old-fashioned serialized senior engineer.

The fifth disadvantage is new failure modes. A bad SDLC wastes time. A bad ADLC can manufacture false confidence at scale. Frozen rails that encode the wrong spec, calibrated reviewers measured on unrealistic bug plants, stale skills loaded into every agent, model-routing priors trained on noisy history, evidence manifests nobody verifies: these are not hypothetical risks. Structure compounds good lessons, but it can also compound bad ones.


Where the two overlap#

The overlap is larger than the rhetoric suggests.

Both lifecycles need requirements. ADLC does not eliminate requirements; it makes them executable. The enterprise PRD becomes interrogated spec plus acceptance criteria plus verification methods.

Both need architecture. ADLC does not eliminate design review; it changes its timing and granularity. Contracts and shared foundations move up front because parallel work depends on them. Deduplication and simplification move after merge because the actual duplication is visible then.

Both need QA. ADLC does not eliminate testing; it makes tests more central. The difference is provenance: rails written from the spec and protected from the builder are not the same thing as tests added by the implementer after the code exists.

Both need security review. ADLC does not replace security with generic model review. It needs explicit security lenses, deterministic scanners, threat-model prompts, planted security bugs for calibration, and escalation to humans for high-risk surfaces.

Both need change management. ADLC still needs release windows, rollback plans, migrations, customer communication, and incident readiness. It can generate better evidence for those gates, but it does not make shared production risk disappear.

Both need human accountability. ADLC can reduce human toil, but it cannot make the model accountable. The human still owns intent, risk acceptance, and final behavioral approval. In regulated environments, that ownership has to remain explicit.

The clean integration pattern is not to replace the enterprise SDLC wholesale. It is to insert ADLC inside the build-and-review portion of the SDLC, then let its evidence feed the existing enterprise gates. The enterprise lifecycle still decides what work is allowed, who owns it, when it ships, and what risk posture is acceptable. ADLC decides how agent-built changes are specified, gated, prosecuted, integrated, and distilled.

Enterprise PRD / ticket Interrogated spec Acceptance criteria with verification methods Frozen tests, contracts, and CI rails Agent-built diff Gate evidence manifest Existing enterprise approval gates test results verified prosecution findings behavior diff models, tools, hashes, spend
Click diagram to zoom

The practical comparison#

DimensionTraditional enterprise SDLCAgentic Development Lifecycle
Primary failure profileHuman coordination, omission, fatigue, politics, knowledge lossPremature satisfaction, sycophancy, context rot, hallucination, reward hacking
Scarce resourceHuman implementation time and reviewer attentionCorrect specification, reliable gates, calibrated verification
Planning postureUp-front planning reduces expensive human reworkUp-front interrogation prevents cheap but rapid misbuilds
TestsVerify code and support regression confidenceEncode the spec and constrain the builder
ReviewHuman inspection, maintainership, knowledge sharingFresh-context prosecution with reproduced findings
ParallelismLimited by team coordination and merge disciplineLimited by partition quality, contracts, and integration throughput
MemoryPeople, docs, tickets, postmortemsLints, skills, templates, manifests, routing priors, ledgers
Human gatesMany approvals across the processTwo default intent gates, plus escalation and enterprise risk gates
Audit evidenceOften manually assembled from process artifactsGenerated continuously as gate evidence
Main riskSlow delivery and ritualized approvalFalse confidence from poorly designed automation

This table is the honest shape of the trade. ADLC is not "less process." It is a different process, with different controls, aimed at different failure modes.


The enterprise adoption path#

The wrong rollout is to announce that the SDLC is dead and replace it with an agentic lifecycle diagram. Enterprises reject that kind of transplant for good reasons. Too many surrounding controls depend on the existing process.

The right rollout is narrower.

Start with prosecution on existing PRs. It relieves a pain everyone already has: large diffs nobody wants to review. Keep human approval exactly where it is. Add verified findings, not new authority.

Then add rails for agent-authored work. Require spec-derived tests and protect them from the builder. This is the first real trust anchor.

Then add interrogation before substantial agent work. Convert "go build this" into acceptance criteria with verification methods. Human spec approval becomes higher leverage than late diff review.

Then add evidence manifests. Feed the existing SDLC gates with better artifacts: what was promised, what was checked, what changed in behavior, what the review stack is calibrated to catch, what remains outside the gate.

Only then add parallelism and distillation. Fan-out is the reward for partition quality, not the starting move. Lesson mining is the reward for enough verified findings to mine.

The enterprise SDLC does not disappear in this adoption path. It becomes the outer governance shell. ADLC becomes the inner production system for agent-built software.

1. Prosecute existing PRs 2. Add frozen rails 3. Interrogate specs 4. Emit evidence manifests 5. Add parallelism 6. Distill recurring lessons
Click diagram to zoom

The bottom line#

The traditional enterprise SDLC asks: how do I coordinate humans so software changes are intentional, reviewed, tested, approved, and supportable?

The Agentic Development Lifecycle asks: how do I constrain probabilistic builders so their speed becomes verified change instead of accelerated ambiguity?

Those are different questions, and mature organizations need both. The SDLC remains the language of ownership, risk, release, compliance, and institutional accountability. ADLC becomes the language of model-shaped production: executable specs, frozen rails, fresh-context prosecution, measured ambiguity, deterministic gates, and compounding lessons.

The mistake is treating one as a drop-in replacement for the other. The opportunity is cleaner than that: keep the enterprise SDLC where it protects enterprise risk, and replace the human-shaped build-and-review core with an agent-shaped lifecycle that produces stronger evidence than the old core ever did.

That is the comparison in one sentence: SDLC governs the organization around the change; ADLC governs the machines producing the change. The overlap is real, the tradeoffs are real, and the boundary between them is where serious agent adoption should start.


Part 9Prosecuting the Gates

Read this part on its own page ↗

I Built the Toolkit With the Lifecycle claimed that the toolkit was built by the lifecycle, and then "aimed at itself." It was a tidy sentence, but one that hadn't earned its keep yet. To practice what I preach, I had to run the lifecycle's own prosecution phase against the gates themselves. The charter for the fresh-context reviewers was simple: find the strongest reason this toolkit's gates do not hold.

What happened next is the ultimate stress-test of the lifecycle: the prosecution successfully broke almost every adversarial surface, exposing the exact builder blind spots that normal testing misses. A victory lap proves nothing a skeptic believes. A system that detects, reproduces, and hardens its own flaws under pressure does.


The shape of the blindness#

The findings rhymed. Every gate had been built to catch an honest mistake (a typo'd path, a forgotten assertion, or a malformed file) and every one folded the instant the operator was lazy or adversarial. Which is the whole problem, because the doctrine's threat model is an adversarial builder: a model under gate pressure deletes the failing test, weakens the assertion, mocks the dependency, and reports all green, sincerely. The toolkit's gates had quietly reverted to the SDLC's cooperative-operator assumption, the exact assumption Stop Running the SDLC on Models That Aren't Human said to throw out.

There's a reason it reverted, and it's the most uncomfortable line in the series to write about your own code. The builder that wrote each gate also wrote that gate's tests, in the same context. So the gate and its tests inherited the same blind spot. The suite was green because it asserted against the inputs the builder imagined, never the input the builder couldn't imagine, which is precisely the one an attacker reaches for. This is Principle 3: never judge work your own context produced, playing out one level up in the tooling built to enforce Principle 3. A fresh context with a refute charter is the only thing that sees it, because a fresh context is the only thing the construction history hasn't already biased.

Three of them, named, because vague claims get the vague treatment and I'm not doing that here.

rails-guard: the freeze a git commit walks through#

The rail freeze is the load-bearing trust anchor of the entire lifecycle: the builder cannot edit its own tests, contracts, or CI config during the build. Post 7's table says rails-guard "blocks builder edits." It does not block; it detects, after the fact, by diffing the working tree. And it diffed against the wrong thing: the default base was HEAD, and git diff HEAD cannot see a change that has already been committed.

So the attack is one command. Edit the frozen test, git commit it, run the gate:

$ rails-guard --rails 'test/**'
rails-guard: all checks passed      # exit 0

Reproduced in a scratch repo, start to finish. The builder edits its own acceptance test, commits, and the freeze gate (the one mechanism in the system that was supposed to be non-probabilistic, the one every other gate's trust hangs on) reports clean and, with --record, writes a signed-looking "rails-diff-empty proof" attesting to it. The gate didn't just miss the edit. It manufactured evidence that no edit occurred.

gate-manifest: provenance you can forge from scratch#

gate-manifest is sold as the evidence chain: "every gate appends a signed entry," in-toto/SLSA for agentic provenance, the thing that lets a regulated org prove agent-written code was verified. There was no signing. Each entry's link to the previous one was a plain sha256 of the prior line (a public, keyless function). Anyone who can write the file can recompute the entire chain.

A prosecutor wrote a clean two-entry chain by hand (tests-pass, then prosecution-clean: SHIP) recomputing each hash, and asked the tool to verify it:

{ "valid": true, "count": 2 }

No gate ever ran. The provenance tool certified a merge that never happened. The chain caught an honest actor who edited one line and forgot to recompute forward; against anyone who runs the tool, it proved nothing.

review-calibration: the honesty meter that lied, and froze the lie into a test#

This is the one that should sting, because review-calibration is the meta-gate: it measures whether your prosecution stack is honest by planting bugs and counting what comes back. It scored a plant as "caught" if the review output mentioned the right line or merely contained a twelve-character substring of the changed line. A reviewer that echoes every changed line and understands nothing scores recall 1.0. The instrument that certifies your reviewer is trustworthy was itself gameable by a cat command.

And its own test suite asserted the echoing reviewer passes; the bug wasn't just present, it was frozen into a green test as the intended behavior. The calibrator had been calibrated to call dishonesty honest.

The broader sweep found the same disease in eight more places: a P2 gate that trusted an environment variable for its verdict, a mutation tester that marked every mutant "killed" when the test command was simply false, a conflict forecaster that reported a cyclic dependency graph as a clean empty schedule, a session supervisor blind to the one log format that actually exists, a behavioral diff that passed a dead server as "unchanged." Eleven gates, one failure class.


The turn: reproduced, then RED#

Two things made this a proof instead of an LLM grumbling about code.

First, every finding was reproduced, not asserted: the doctrine's "evidence or it didn't happen" turned back on the toolkit. The rails-guard bypass ran in a shell to exit 0. The gate-manifest chain was forged and verified. The echoing reviewer was scored at 1.0. A finding nobody can reproduce is noise that burns fix-agent tokens chasing ghosts; I let none through.

Second, each fix started from the reproduction as a failing test. rails-guard now resolves its baseline to the merge-base with the trunk and fails closed when it can't (a committed rail edit is caught, and the test that proves it is the bypass itself, inverted). gate-manifest gained real keyed signing; the forged chain that used to verify valid now fails. review-calibration was rebuilt to score a plant as caught only when a finding locates and identifies it (verified behaviorally or judged semantically, with no substring shortcut) and the echoing-reviewer test was flipped: it now asserts recall ~0. That inverted test became the control that runs on every calibration from here on. The bug that hid in a green assertion is now the assertion that guards against its own return.

The trail is real and ordinary: a frozen-baseline fix to the shared core, a sweep across eleven gates, and the calibrator's rebuild, landed as commits. Each carries the regression test that reproduces its exploit (roughly eighty new tests across the toolkit) with every package green. Not a narrative. A diff.


Practicing what I preach: the real proof-of-work#

The series keeps making one move (replace trust with structure, structure with measurement) and this episode is that move applied to the toolkit itself, which means it's the doctrine's own claims, tested on the doctrine's own artifacts:

  • F2: self-review is worthless. The builder's green suite certified gates the builder couldn't see were broken. Asking the toolkit "do you look right?" returned yes, exactly as predicted.
  • E4: fresh context has an inverse value. The only reviewer that found the blindness was one with no stake in the code being right and no memory of why it was written that way. Contamination by construction history is real; the cure is a context that lacks it.
  • The recursion closes. I calibrated the calibrator: the tool that measures reviewer honesty was made honest by the same trick it performs on everyone else: plant the failure, measure, and gate. Dogfooding usually means "I used my own product." Here it means my own product diagnosed my own product, and the diagnosis was you are sick.

The honest boundaries, because a doctrine without them is marketing, and because this post, like every gate, deserves its own prosecution:

The prosecutors were agents under one operator's direction, same model vendor. The active ingredient was not cross-vendor independence; it was fresh context plus a refute charter, and a human deciding what to reproduce. "Independent review" oversells it; "a critic with no construction-history bias and a charter to break things" is the accurate claim, and it was enough.

And the gates are hardened against this class, not proven correct. Eleven holes found and closed is not eleven holes that existed; it's the eleven a few prosecution passes surfaced. The next class is already visible: nobody ran a generator whose entire charter is "produce a diff that passes every gate and is wrong," run continuously against the gate suite like a fuzzer for the lifecycle. That tool (call it gate-fuzzing) is the one that finds the holes a fixed review pass won't, and it's the next thing to build, because the lesson of this episode is that a gate you haven't tried to defeat is a gate you haven't tested.

That's the real shape of dogfooding. Not "I ate my own cooking and it was delicious." I ate it, it made us sick, the kitchen's own instruments told us exactly why, and now the test for that sickness runs on every meal.


Try it on your own gates#

You have gates too: the lint config nobody audits, the CI check that's green for reasons no one has verified, the review bot whose recall is unknown. Point a fresh-context, refute-chartered pass at one of them and ask the only question that matters: what is the strongest reason this does not hold? Reproduce whatever comes back before you believe it, and turn each reproduction into the test that was missing. The first time is reliably humbling. It is also the cheapest review you will ever run, and the only one that tells you what your gates are actually worth.

Start of series: Stop Running the SDLC on Models That Aren't Human →