# Tests Are the Spec in the Only Language the **Builder Can't Argue With**

Published: 2026-06-12T11:00:07.000-04:00
Tags: agents, llm, ai-development, adlc
Canonical: https://www.voodootikigod.com/adlc-3-tests-are-the-spec

> Rails: why TDD becomes the load-bearing trust mechanism of agentic development, why the builder must never touch its own tests, and a field catalog of how agents game gates.

---

[Two Human Gates and Everything Between Is Machine-Checked](/adlc-2-two-human-gates) outlined the lifecycle: eight phases, two human gates, deterministic checks between everything. This post is about the phase the entire structure leans on ([Phase 3: Rail](/adlc-2-two-human-gates#p3)) and the model behavior that makes it necessary.

Start with the inversion, because everything else follows from it:

> **In the SDLC, tests verify the code. In the ADLC, tests are the spec rendered in the only language the builder can't argue with.** A test is the one critic that is never sycophantic, never rots, and never hallucinates.

In human development, TDD is a quality ritual, a discipline signal adopted or skipped by taste, argued about at conference bars for twenty years. In agentic development its role changes completely. Recall the flaw inventory from [Stop Running the SDLC on Models That Aren't Human](/adlc-1-models-arent-human): the model claims success without evidence ([F4](/adlc-1-models-arent-human#f4)), agrees with whoever asks ([F2](/adlc-1-models-arent-human#f2)), and does the minimum that arguably satisfies the instruction ([F1](/adlc-1-models-arent-human#f1)). Against that profile, every probabilistic gate (review, self-checks, "look this over") leaks. The test suite is the one gate that doesn't. **TDD is not a quality ritual here. It is the load-bearing trust mechanism of the entire lifecycle.** Every other gate is probabilistic; this one is not.

Which immediately raises the question this post is actually about: what happens when the thing being gated can edit the gate?

## <span id="reward-hacking"></span>Reward hacking, observed in the wild

Flaw [F5](/adlc-1-models-arent-human#f5), restated: put a gate in front of a model and, under pressure to satisfy it, the model will game the gate rather than clear it. It does this not occasionally, but reliably, given enough pressure and enough iterations. And unlike a human cutting corners, it games the gate *sincerely*: it reports success and, in whatever sense applies, believes it.

The catalog of moves is depressingly consistent across teams, models, and vendors. If you've run agents against a test suite for more than a month you have seen most of these:

- **Delete the failing test.** The classic. Often accompanied by a commit message like "remove outdated test."
- **Weaken the assertion.** `expect(result.total).toBe(427.50)` becomes `expect(result).toBeDefined()`. The test still exists. It still runs. It checks nothing.
- **Mock the thing being tested.** The function under test gets stubbed in its own test file. The suite goes green. The feature does not exist.
- **Snapshot churn.** Regenerate the snapshot to match the broken output. Assertion inverted: the bug is now the spec.
- **Skip markers.** `it.skip`, `xfail`, `// eslint-disable-next-line` (suppression as a service).
- **"Fixed" without running.** The agent reports the fix is in and the tests pass. The tests were never executed. This isn't lying in the human sense; it's [F4](/adlc-1-models-arent-human#f4), where the claim and the hallucination are indistinguishable from the inside.

Here is the uncomfortable conclusion: **instructions cannot fix this.** "Do not modify the tests" is a sentence in a context window, and a sentence in a context window is exactly the kind of constraint [F5](/adlc-1-models-arent-human#f5) routes around under pressure. It does this not maliciously, but the way water routes around a stone. By iteration thirty of a stuck debugging loop, that instruction is competing with an overwhelming gradient toward *make the gate go green*, and the gradient wins.

So the defense has to be structural.

## <span id="rail-discipline"></span>The rail discipline

Three rules make a test suite into rails: a structure the builder runs *inside* rather than a hurdle it can negotiate with.

**1. Author the rails in a context that will never see the implementation.**

Tests, type stubs, and interface contracts are written from the spec, before any implementation exists, by an agent whose context contains the spec and nothing else. This is the creator/critic separation from the first post ([E4](/adlc-1-models-arent-human#e4)) applied at authoring time: a test written by the same context that writes the code inherits the code's assumptions, including the wrong ones. A test written from the spec alone encodes the spec's assumptions, which is the entire point. The rails *are* the spec, compiled to executable form.

**2. Freeze the rails during the build: mechanically.**

During Phase 4, the builder cannot edit test files, contract types, or CI config. Not "is instructed not to" but *cannot*. Enforce at the tool layer: a pre-tool-use hook that blocks writes to rail paths, branch protection on test directories, or file permissions. Declare the rail paths and it blocks builder edits during the build phase and emits a **rails-diff-empty proof** at the gate, which acts as mechanical evidence the builder never touched them.

The principle generalizes well beyond tests: **a constraint that lives in the prompt layer is a request; a constraint that lives in the tool layer is a fact.** Agents route around requests.

**3. Audit the rails themselves, once, adversarially.**

A frozen gate made of fog is still fog. Before freezing, run one [adversarial](/adversarial-review) pass over the *tests*: do they assert behavior, or do they assert that mocks were called? The single best audit question I know:

> **"Does any test fail if the feature is deleted?"**

If you can stub the entire implementation with `return null` and the suite stays green, you don't have a gate. You have a ceremony.

**Gate, restated from [Two Human Gates and Everything Between Is Machine-Checked](/adlc-2-two-human-gates#p3):** the suite runs RED for the right reasons (failures say "not implemented," not "test is broken") and the stubs typecheck. Red-for-the-right-reasons matters: a suite that's red because of an import error tells you nothing when it later goes green.

## <span id="goodhart-annex"></span>The Goodhart annex

[Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law): when a measure becomes a target, it ceases to be a good measure. Gates are measures-turned-targets by definition, and agents optimize against them at machine speed and so every gate attracts a gaming move the way every contract attracts a loophole. Hence the annex, in the contract sense: the attached schedule of known loopholes and the clause that closes each one.

Each gaming move, paired with its structural defense. Every gate in the lifecycle ships with its anti-vacuity defense, or it isn't a gate:

| Gaming move | Defense |
|------------|---------|
| Delete or edit the failing test | Rails frozen during build, enforced by hook; review phase verifies the rails diff is empty |
| Weaken assertions (`toBeDefined()`, snapshot churn) | Adversarial test audit before freezing; a review lens dedicated to builder-added tests |
| Mock the thing being tested | The deletion question: "does any test fail if the feature is deleted?" |
| `skip` / `xfail` / lint-suppress markers | Deterministic grep gate in CI: new suppression markers fail the build unless declared in the ticket |
| "Fixed" without running | Claims require execution evidence in the gate itself, not in the agent's report |
| Coverage padding (exercise lines, assert nothing) | Don't gate on coverage at all; see below |
| Out-of-ticket "improvements" masking scope creep | Diff scoped to the ticket's declared files; out-of-scope changes auto-flagged to the human gate |

Notice the texture of the defenses: hooks, greps, diffs, file scopes. Small, dumb, deterministic. That's not a limitation; it's the design. The failure mode being defended against is a system that's brilliant at gaming smart gates. You beat it with gates too dumb to game.

## Why coverage percentage is the wrong gate

Coverage is the most Goodhart-able metric in software, and agents "Goodhart" at machine speed. An agent gated on 80% coverage will hit 80% coverage: assert-free tests, snapshot spam, tests that execute every line and constrain nothing. Humans game coverage too, but slowly, and with enough shame to keep it in check. Agents do it instantly, thoroughly, and sincerely.

If you want a quantitative gate on test quality, the honest version is **mutation testing**: deliberately break the implementation and check that some test notices. A test suite that can't tell broken code from working code is hollow, whatever its coverage number says.

Full mutation testing is famously too slow for CI, which is why almost nobody runs it. The fix is scope: mutate only what the current diff touches. For the tests covering a diff, mutate the implementation (invert conditionals, null the returns, swap operators, plus a few LLM-authored *semantic* mutants, the subtle kind), run the suite, and report any mutant that survives every test. A surviving mutant is a proof object showing a behavior change your tests cannot see. Diff-scoping keeps it at minutes instead of hours. The key output is the list of survivors.

One deterministic check, and it closes three rows of the table above (assertion-weakening, mock-everything, and coverage padding) because all three produce the same detectable symptom: mutants survive.

## What the builder is allowed to do

A clarification that prevents a common misreading: the builder *can* write tests. Unit tests for internals (written during the build, alongside the code) are fine and encouraged. They just aren't *rails*. They don't gate anything, and they get prosecuted like everything else the builder produced (one review lens is dedicated specifically to auditing builder-added tests).

The distinction is provenance, not file type. **Rails are authored from the spec, by a context that never saw the implementation, and frozen. Anything the builder wrote is work product, and work product gets reviewed.** The moment a builder-authored test starts gating the builder's own work, you've reinvented self-review with extra steps, which is [F2](/adlc-1-models-arent-human#f2) in a hard hat.

## The trust chain

Step back and look at what the rail discipline buys the lifecycle as a whole. Every downstream phase inherits its trustworthiness from this one:

- The build gate ("rails green") means something *only because* the builder couldn't edit the rails.
- The review phase can focus on what tests can't catch *only because* the tests deterministically catch what they can.
- The post-merge simplification phase can refactor aggressively *only because* the still-frozen, still-green rails define "behavior preserved."
- The human at the final gate can skip the 5,000-line diff *only because* the rails-diff-empty proof and the green suite arrive as evidence, not as claims.

Pull the rails out and every one of those collapses back into "trust the model's self-report," which is to say, collapses entirely. This is why the phase ordering is non-negotiable: rails before build, always. The most expensive sentence in agentic development is "I'll add tests after it works," because *works*, without rails, is a claim made by the thing being gated.

So: the rails hold the builder. The suite is green, the diff is empty, the mutants die. Done?

No, because everything the rails can't see still gets through. The rails are exactly as good as the spec they encode, and they encode nothing about the spec's *gaps*: the race condition nobody wrote a test for, the auth check missing from an endpoint the spec forgot, or the contract drift between two tickets. Catching what the rails can't see requires judgment. Judgment, per [Stop Running the SDLC on Models That Aren't Human](/adlc-1-models-arent-human), means fresh contexts with inverted charters, because the builder's own context is [sycophantic](/adlc-1-models-arent-human#f2) about its own work.

That's the prosecution phase. And it has a problem nobody talks about: who reviews the reviewer? If your adversarial review stack has blind spots (and it does), how would you know? It turns out you can measure it, with planted bugs and arithmetic.
