# Two Human Gates and **Everything Between Is Machine-Checked**

Published: 2026-06-12T11:00:08.000-04:00
Tags: agents, llm, ai-development, adlc
Canonical: https://www.voodootikigod.com/adlc-2-two-human-gates

> Eight phases, exactly two mandatory human moments, deterministic gates between everything, and a spend curve shaped like a barbell.

---

The last post, [Stop Running the SDLC on Models That Aren't Human](/adlc-1-models-arent-human), laid out the argument that the SDLC defends against human failure modes, models fail differently, and every phase of an agentic lifecycle must trace to a specific model flaw it defends against or a model property it exploits.

This post introduces the lifecycle that falls out of that rule. Eight phases. Deterministic gates between every pair. And exactly **two** mandatory human moments in the entire loop. Just two, so get ready to put your trust in the machine.

## The shape

```mermaid
graph TD
    P0[P0: Triage] --> P1[P1: Interrogate]
    P1 --> G1{Human Gate 1:<br/>Spec Approval}
    G1 -- Approved --> P2[P2: Decompose]
    G1 -- Redo --> P1
    P2 --> G2[Gate: Cold-Start Test]
    G2 --> P3[P3: Rail]
    P3 --> G3[RED Gate: Tests fail, types check]
    G3 --> P4[P4: Build]
    P4 --> G4[Green Gate: Rails green, build/lint pass]
    G4 --> P5[P5: Prosecute]
    P5 --> G5[Zero-Findings Gate: No open findings]
    G5 --> P6[P6: Integrate]
    P6 --> G6{Human Gate 2:<br/>Behavioral Acceptance}
    G6 -- Approved --> P7[P7: Distill]
    G6 -- Redo --> P4
    P7 -->|Feeds next run| P0
```

Before walking through it, one principle that governs all the arrows: **an LLM→LLM handoff without a deterministic checkpoint multiplies error rates.** The chain is only as strong as its non-LLM links. Between any two phases there must be something that cannot hallucinate, such as a compiler, a test suite, a schema validator, or a human. Probabilistic components in series compound their error; deterministic gates between them reset it.

### <span id="p0"></span>Phase 0: Triage

Not everything earns the full lifecycle, and running the full ceremony on a typo is how agentic lifecycles die of friction in week two, as well as your token budget. Route by **risk × blast radius**, not size:

- **Trivial** (copy change, config tweak with existing coverage): direct edit, existing tests, one review pass. Cheap model.
- **Bounded** (bug fix inside one module): skip straight to Phase 3, writing the failing test that *is* the bug report, then fixing it and running a [light review](/adversarial-review).
- **Substantial** (new feature, cross-cutting change): full lifecycle.
- **Architectural** (new system, contract changes): full lifecycle plus design alternatives evaluated by a judge panel.

### <span id="p1"></span>Phase 1: Interrogate

The single highest-leverage phase, because error here compounds through everything after it, and no downstream gate can catch "built the wrong thing correctly."

The mechanism is interrogation: *ask me questions until you have none left, but check the codebase before asking each one.* That second clause is the half that matters. Without it you get twenty questions the repo already answers, and the human tunes out by question six. Quite transparently, this one crystalized for me thanks to [Matt Pocock's now famous `grill-me` skill](https://www.skills.sh/mattpocock/skills/grill-me).

The framing correction that took me a while though was people say planning "reduces non-determinism," and that's wrong in a way that matters. Sampling randomness is not the enemy; at temperature zero, a vague spec still yields confidently wrong code. The enemy is **underspecification**. The model fills every gap with its prior, and its prior is "whatever is most generic." Interrogation works by transferring the spec from your head into the context *before* the gaps get filled by invention. That's flaw [F1](/adlc-1-models-arent-human#f1) from [Stop Running the SDLC on Models That Aren't Human](/adlc-1-models-arent-human#f1) (premature satisfaction) being starved of gaps to exploit.

The output is a spec where **every acceptance criterion names its verification method**: a test to be written, a command whose output is asserted, or a behavior demonstrated. A criterion with no verification method is a wish, and wishes get the minimum-effort treatment. 

**Gate: a human approves the spec.** This is human gate one of two, and it is the human's highest-value moment in the entire lifecycle. **This is our moment to shine.** Minutes here replace hours of diff review later. Use the best model you have in this phase and don't economize; a subtly wrong spec sails through every downstream gate and poisons everything. Invest the tokens, invest the time. This has been re-iterated by even the [author of Claude Code, Boris Cherny](https://x.com/bcherny/status/2007179845336527000).

### <span id="p2"></span>Phase 2: Decompose

Defends against context rot ([F3](/adlc-1-models-arent-human#f3)). The unit of work is sized to the *useful* context window (the region before judgment degrades) not the advertised one.

Slice the spec into atomic tickets, each executable by a fresh agent from the ticket text alone. Draw partition lines along interfaces, and write the **contract at each boundary explicitly** (types, schemas, endpoint shapes). Contracts are what let the build phase parallelize safely; parallel agents that collide do so on shared types and configs, never on feature code. Pin the shared surface first and parallel construction stops colliding.

**Gate (the cold-start test):** hand each ticket to a fresh, *cheap* model and ask "what's missing to execute this without asking a single question?" If a cheap model can enumerate the gaps, the ticket is underspecified for the mid-tier model that will actually run it. Costs pennies (if even) per ticket. Catches the number-one cause of build-phase flailing before a more expensive model burns dollars discovering it. 

### <span id="p3"></span>Phase 3: Rail

The trust anchor of the whole lifecycle: tests, type stubs, and contracts authored from the spec **in a context that will never see the implementation**, then frozen. The builder cannot edit them. [Tests Are the Spec in the Only Language the Builder Can't Argue With](/adlc-3-tests-are-the-spec) is entirely about this phase, so for now, let's just talk about the gate:

**Gate: the suite runs RED for the right reasons** (failures say "not implemented," not "test is broken") and the stubs typecheck.

### <span id="p4"></span>Phase 4: Build

One fresh agent per ticket: ticket + relevant skills + frozen rails. No carry-over context between tickets (F3 again). Parallelize across partitions in git worktrees, single writer per partition, merge sequentially.

Mid-tier model by default. This surprises people, so it gets its own principle: **model tier is a function of the cost of *detecting* an error, not of task prestige.** Where the rails are dense, errors are caught instantly and deterministically, so the cheapest model that clears the gates is the correct one. Where errors are expensive to *find* (specs, contracts, migrations without coverage) that's where the frontier model goes. This inverts the common instinct (best model writes the code). The code is the most-verified artifact in the system; the spec is the least.

Two operational rules worth stealing even if you adopt nothing else:

- **Two-strike regeneration.** If an agent flails (loops on the same error or starts touching files outside its ticket) do not coach it inside the same rotting context. Kill it, append the dead-ends to the ticket ("known failed approaches: …"), and start fresh. If the regeneration also fails, the *ticket* is wrong; escalate to Phase 2, not to a bigger model. The second-cheapest fix is a fresh start; the most expensive is a long conversation with a confused agent.
- **No personas.** "You are a senior Next.js engineer with 15 years of experience" adds vibes, not capability. An agent is its context, tools, charter, and gate. Skills add capability; charters add direction; costumes add tokens.

**Gate: rails green, build passes, lint passes.** Deterministic. No opinions.

### <span id="p5"></span>Phase 5: Prosecute

Not "code review." Prosecution: fresh contexts chartered to *refute*, with the burden of proof on the finding; every finding is reproduced by a verifier or killed before anyone acts on it, and the fan-out loops until two consecutive passes come up dry. [Prosecution, Not Code Review](/adlc-4-prosecution-not-code-review) covers this phase and the tooling that measures whether your review stack actually catches anything.

**Gate: zero verified open findings, rails still green, and the rails diff is empty**, which is mechanical proof the builder never touched the tests.

### <span id="p6"></span>Phase 6: Integrate

Human gate two, and it is *not* "read the diff."

The 5,000-line diff read is litmus theater. The human scrolls, pattern-matches for nothing in particular, approves, and the organization books "human in the loop." Human attention is the scarcest, most costly resource in the lifecycle; spend it where machines are blind:

- Read the **spec-conformance summary**: what was promised, what was verified, what was explicitly not done.
- Read the **test diff** (small, high-signal, and it *is* the behavioral contract).
- **Run the thing.** A two-minute demo catches the one category of wrongness no reviewer-agent can: "this is technically correct and not what I meant."
- Spot-check the two or three hotspots prosecution flagged. Not the whole surface.

**Gate: human behavioral acceptance.** "Is this what I meant, *running*?"

### <span id="p7"></span>Phase 7: Distill

The phase everyone skips, which is why their costs stay flat while their codebases bloat. Two halves: **simplify** (post-merge dedup and dead-code removal under the still-green rails, where you should expect a substantial reduction on agent-generated code) and **mine** (recurring review findings become lint rules; recurring interrogation questions become spec templates; [conventions become skills](https://www.voodootikigod.com/skill-mining). This is the compounding loop; [The Lifecycle That Gets Cheaper Every Run](/adlc-6-lifecycle-gets-cheaper) explains and explores this fully.

## The two human gates, stated plainly

The entire lifecycle has exactly two mandatory human moments, by design:

1. **Phase 1: "Is this what I meant?"** (spec approval).
2. **Phase 6: "Is this what I meant, running?"** (behavioral acceptance).

Everything between them is machine-gated. Humans intervene elsewhere only on escalation: non-converging loops, out-of-scope flags, contract changes. This is not human-out-of-the-loop. It is human-at-the-two-points-where-human-judgment-is-irreplaceable, instead of human-as-tired-diff-scroller. It is "Right Tool, Right Task, Right Time". The human is the ground truth for intent, and intent is checked exactly twice: once as words, once as behavior.

## <span id="barbell"></span>The barbell

Where does the money go? Heavy at the ends, light in the middle:

<BarbellChart />

If your spend is concentrated in Phase 4 (the build) your team is exploring (re-reading the codebase every run) instead of exploiting (skills, atomic tickets, cached context). That's a diagnostic, not a judgment; it tells you which phase is missing.

The barbell also explains why this lifecycle reads as heresy to agile instincts. Agile economized on planning because human building was slow and specs went stale before the build caught up. Building is now fast and cheap; **misbuilding is what's expensive**. The economics inverted, so the phase weighting inverts. "Working software over comprehensive documentation" was a correct response to 2001's cost structure. It is the wrong response to this one.

## Norms rejected

Positions this lifecycle takes deliberately, so you can disagree deliberately:

| Norm | Verdict | Why |
|------|---------|-----|
| Human review of full agent diffs | **Reject** | Theater past ~500 lines. Attention goes to spec, test diff, behavior |
| Agile-weight planning | **Reject for agentic work** | The economics inverted; see above |
| Persona engineering | **Reject** | Capability lives in skills, tools, charters. Costumes are token overhead |
| Multi-agent *collaborative construction* (3-7 creators comparing notes) | **Reject** | Search-parallelism misapplied to construction. Partition + contract + single writer instead |
| DRY at authoring time | **Reject** | Dedup moves to Phase 7, where it's mechanical instead of speculative |
| Coverage % as a quality gate | **Reject** | Goodharted at machine speed. More in [Tests Are the Spec in the Only Language the Builder Can't Argue With](/adlc-3-tests-are-the-spec) |
| Token quotas as cost control | **Reject** | Caps the wrong variable. A quota-pressured developer cuts the review phase first, which represents the most valuable tokens in the system. Govern cost per merged, verified change instead |
| Mid-task model failover | **Reject** | Coherence loss (F8). Models switch at task boundaries only |

Every row traces back to the flaw inventory. That's the test: if you find yourself adding a ritual that doesn't trace, you're importing human-shaped process again.

Next up is the phase the whole structure leans on. A phrase that has continually gained focus and importance as we evolve with agentic work: *in traditional software development (SDLC), tests verify the code; in agentic development (ADLC), tests are the spec rendered in the only language the builder can't argue with.* The builder will try to argue anyway (by editing the tests). What happens then is the subject of [Tests Are the Spec in the Only Language the Builder Can't Argue With](/adlc-3-tests-are-the-spec).