Mutation Testing

Nightly mutmut gate on generated app/ source plus a complementary stryker4s gate on Scala generator modules

The behavioral test suite is only as good as the bugs it catches. A test that exercises every line of app/ but never fails when that code is wrong is a liability. To keep the generated suite honest we run mutmut nightly: it rewrites the generated app/ source one mutation at a time (== to !=, < to <=, True to False, drop return value, etc.) and re-runs pytest tests/test_behavioral_*.py against each mutated copy. A mutation that survives means the suite did not detect the change: either the clause it should have caught is not actually being asserted, or the mutation is genuinely equivalent (rare, but real).

The CI gate fails if the kill rate drops below 90% on any of the canonical fixtures.

Where it runs

.github/workflows/mutation-testing.yml runs on schedule: '0 3 * * *' (3 AM UTC) and on workflow_dispatch. It is deliberately not part of the per-PR CI: a single mutmut run on url_shortener is hundreds of pytest invocations and easily exceeds the per-PR budget. The matrix covers safe_counter, todo_list, and url_shortener (each fixture has its own job with timeout-minutes: 90). Each fixture gets a Postgres 17 service container with POSTGRES_USER/PASSWORD/DB all set to the fixture name; the jobs are independent (fail-fast: false). auth_service is tracked for inclusion once its 15% testgen-translator skip rate closes.

How a single matrix entry works

1. sbt cli/run compile --with-tests --ignore-verify --out generated/<fixture> \
     --framework fastapi --db postgres ...
2. uv sync --all-extras                       # pulls mutmut>=3.3,<4 from generated pyproject
3. uv run alembic upgrade head                # creates the schema in the postgres service
4. uv run pytest tests/test_behavioral_*.py -x --tb=short -q  # sanity: must pass before mutating
5. cat >> pyproject.toml <<'TOML'             # paths_to_mutate / paths_to_exclude / runner
   [tool.mutmut]
   paths_to_mutate = ["app/"]
   paths_to_exclude = ["app/db/", "app/routers/test_admin.py"]
   runner = "uv run pytest tests/test_behavioral_*.py -x --tb=no -q"
   TOML
6. uv run mutmut run || true                  # populates .mutmut-cache
7. scripts/mutation_check.sh generated/<fixture> "$MUTATION_THRESHOLD"  # gate (90 today)

Step 4 is the safety net: if behavioral tests don't pass against the unmutated app, mutmut runs are noise. Step 5 writes the mutmut config inline because mutmut auto-discovers [tool.mutmut] in pyproject.toml; the runner uses --tb=no (less noise; mutmut just needs the exit code) while step 4's sanity check uses --tb=short (so a real failure is debuggable). Step 6 runs with || true so mutmut's nonzero exit on any survived mutation doesn't short-circuit the workflow; the gate decision belongs to the threshold script.

The TestClient mode (`SPEC_TEST_INPROC=1`)

mutmut mutates files on disk. For the runner to see those mutations the test has to import app/ fresh per invocation. The default conftest sends HTTP to an external uvicorn, which gets loaded once and pins the original bytecode; every mutation would survive trivially. Setting SPEC_TEST_INPROC=1 makes tests/conftest.py instantiate fastapi.testclient.TestClient(app) against from app.main import app, so each pytest process imports the (mutated) app fresh. Postgres is still in the loop; the FastAPI request handler runs in-process, but its DB calls go to the real database.

The workflow sets this env var globally; local users can opt in the same way.

The threshold script

scripts/mutation_check.sh parses mutmut results --all True and computes:

score = killed / (killed + survived + suspicious + timeout)

skipped and no_tests mutations are excluded from the denominator: they are not a quality signal (mutmut couldn't run the test for some infrastructure reason; a survived mutation by contrast is a real gap). On failure the script lists each survived/suspicious/timeout mutation with its diff (mutmut show <id>), so the workflow log alone is enough to start investigating.

Threshold defaults to 90. To run the gate locally:

sbt "cli/run compile --framework fastapi --db postgres --with-tests --ignore-verify --out /tmp/svc fixtures/spec/safe_counter.spec"
cd /tmp/svc && uv sync --all-extras

# Postgres on :5432 with user/pass/db = "safe_counter"
docker run -d --rm -p 5432:5432 \
  -e POSTGRES_USER=safe_counter \
  -e POSTGRES_PASSWORD=safe_counter \
  -e POSTGRES_DB=safe_counter \
  postgres:17-alpine

DATABASE_URL=postgresql+asyncpg://safe_counter:safe_counter@localhost:5432/safe_counter \
  uv run alembic upgrade head

cat >> pyproject.toml <<'TOML'

[tool.mutmut]
paths_to_mutate = ["app/"]
paths_to_exclude = ["app/db/", "app/routers/test_admin.py"]
runner = "uv run pytest tests/test_behavioral_*.py -x --tb=no -q"
TOML

SPEC_TEST_INPROC=1 ENABLE_TEST_ADMIN=1 \
DATABASE_URL=postgresql+asyncpg://safe_counter:safe_counter@localhost:5432/safe_counter \
  uv run mutmut run

# From the spec_to_rest checkout (adjust the path to your clone):
bash ~/Desktop/spec_to_rest/scripts/mutation_check.sh . 90

Reading a failing report

A typical failure block in the workflow log (mutmut_check.sh writes the score line to stdout, the failure breakdown and per-mutation diffs to stderr; CI logs concatenate both):

mutmut score: 86.45% (killed=45 evaluated=52; survived=5 suspicious=2 timeout=0; skipped=0 no_tests=3)

MUTATION SCORE 86.45% < threshold 90%
survived/suspicious/timeout mutations:

--- app.services.url_mapping.x_list_all__mutmut_2 ---
# app.services.url_mapping.x_list_all__mutmut_2: survived
--- app/services/url_mapping.py
+++ app/services/url_mapping.py
@@ -34,5 +34,5 @@
 async def list_all(self) -> list[UrlMappingRead]:
-    result = await self._session.execute(select(UrlMapping))
+    result = await self._session.execute(select(None))
     rows = result.scalars().all()
     return [UrlMappingRead.model_validate(row) for row in rows]

The example uses app/services/url_mapping.py because that's the entity-named service file the codegen actually emits for url_shortener.spec (entity UrlMapping snake-cased). Without --with-synthesis, codegen emits service stubs that raise NotImplementedError for any operation classified LLM_SYNTHESIS; the live mutation surface is therefore dominated by framework code (app/main.py, app/database.py, read-only CRUD shells like list_all above). Two shipped levers can widen that surface:

Phase 6 handler synthesis from ensures clauses. All seven M6 wedges are merged: operation classification (#31), Dafny signature generation (#32), LLM integration (#28), CEGIS feedback loop (#29), compile --with-synthesis Dafny-to-target translation (#27), graduated fallback (#30), and DafnyPro-style hint augmentation (#229). Pass --with-synthesis to the matrix command and mutmut will mutate verified Dafny bodies instead of NotImplementedError stubs. The current matrix does not enable synthesis (CEGIS runs are wall-clock heavy compared to a 90-minute mutation budget); flipping it on is a workflow tweak.
Runtime enforcement of invariants/temporals (#86). Generated services carry invariant / temporal property tests, OpenAPI x-invariant/x-temporal annotations, and a stateful-test temporal observer. The checks live around the bodies, so they raise the kill rate even without synthesis: a mutation that breaks an invariant gets caught by the observer rather than slipping through a per-clause behavioral test.

For each survived mutation, decide:

Genuine gap. The spec ensures clause that should have caught this mutation is not being translated to a property test, or the property test is too lax. Fix is in Behavioral.scala or its callers.
Equivalent mutation. The mutation produces semantically identical code (e.g., await session.commit() and await session.flush() agreeing on some paths, or a guard whose condition is implied by an earlier guard). Equivalents are inevitable; the 10% slack in the 90% threshold absorbs them.
Test infrastructure gap. The property test runs but its assertions don't cover this code path (e.g., a status-code assertion that's lax enough to accept the mutated 200 alongside the original 201). Tighten the ensures expression in the spec or the assertion in Behavioral.scala.

The mutation IDs are stable across runs as long as app/ source positions don't shift, so an investigation can pin a specific ID and apply it on disk to reproduce locally:

cd generated/url_shortener
uv run mutmut apply app.services.url_mapping.x_list_all__mutmut_2
git diff app/  # see the mutation
uv run pytest tests/test_behavioral_url_shortener.py -x   # observe what (didn't) catch it
git checkout app/  # revert

Threshold rationale

Per-fixture threshold is a single number passed to scripts/mutation_check.sh as the second argument. Today every matrix entry uses 90. If a fixture proves to have an unusually high equivalent- mutation rate (e.g., heavy SQLAlchemy session boilerplate that mutmut keeps mutating between morally-equivalent shapes), the per-fixture threshold can be lowered in the workflow without changing the script. We prefer that to a global lower bar.

Excluded paths

paths_to_exclude skips:

app/db/: Alembic migrations and base SQLAlchemy metadata. Mutating schema definitions yields runtime errors that aren't a test-quality signal.
app/routers/test_admin.py: the admin router only exists to support tests. Mutating it would change the test harness rather than the code under test.

Routers (app/routers/<entity-plural>.py), services (app/services/<entity>.py), schemas (app/schemas/<entity>.py), and models (app/models/<entity>.py) are all in scope, plus the framework glue (app/main.py, app/config.py, app/database.py, app/redaction.py).

Scala-side mutation gate (stryker4s)

The mutmut gate above is a consumer-side check: it tests whether the generated behavioral suite catches mutations to the generated FastAPI service. It does not test the generator. A bug in our Scala generator that consistently emits the same wrong code on every run would survive, because the consumer tests are written against whatever the generator produced.

.github/workflows/mutation-scala.yml runs stryker4s nightly (0 4 * * *) on the convention and lint modules, the two pure hand-written domain modules with the largest under-tested surface (the ir module is mostly Isabelle-extracted code already proven correct, and profile/testgen are out of scope for this round). Configuration lives in stryker4s.conf at the repo root; the first run is report-only (thresholds.break = 0, the no-fail default in stryker4s) so we can baseline the kill rate before setting a gate.

Non-goals

Per-PR mutation testing. Too slow; the PR-time gate is the structural
- behavioral + stateful conformance suite.
Mutating Isabelle-extracted Scala (modules/ir/.../generated/). That code is mechanically extracted from a verified Isabelle theory; surviving mutants there carry no test-quality signal, since the Isabelle proof is the correctness gate.
Alternative engines (cosmic-ray, MutPy). mutmut is the canonical choice today; switching engines is a much larger lift than tuning the current setup.

On this page