Mutation Testing
Edit on GitHubNightly mutmut gate on generated app/ source plus a complementary stryker4s gate on Scala generator modules
Last updated:
The behavioral test suite is only as good as the bugs it catches. A test that
exercises every line of app/ but never fails when that code is wrong is a
liability. To keep the generated suite honest we run mutmut
nightly: it rewrites the generated app/ source one mutation at a time
(== to !=, < to <=, True to False, drop return value, etc.) and
re-runs pytest tests/test_behavioral_*.py against each mutated copy. A
mutation that survives means the suite did not detect the change: either the
clause it should have caught is not actually being asserted, or the mutation
is genuinely equivalent (rare, but real).
The CI gate fails if the kill rate drops below 90% on any of the canonical fixtures.
Where it runs
.github/workflows/mutation-testing.yml
runs on schedule: '0 3 * * *' (3 AM UTC) and on workflow_dispatch. It is
deliberately not part of the per-PR CI: a single mutmut run on
url_shortener is hundreds of pytest invocations and easily exceeds the
per-PR budget. The matrix covers safe_counter, todo_list, and
url_shortener (each fixture has its own job with timeout-minutes: 90).
Each fixture gets a Postgres 17 service container with
POSTGRES_USER/PASSWORD/DB all set to the fixture name; the jobs are
independent (fail-fast: false).
auth_service is
tracked for inclusion once its 15% testgen-translator skip rate closes.
How a single matrix entry works
1. sbt cli/run compile --with-tests --ignore-verify --out generated/<fixture> \
--framework fastapi --db postgres ...
2. uv sync --all-extras # pulls mutmut>=3.3,<4 from generated pyproject
3. uv run alembic upgrade head # creates the schema in the postgres service
4. uv run pytest tests/test_behavioral_*.py -x --tb=short -q # sanity: must pass before mutating
5. cat >> pyproject.toml <<'TOML' # paths_to_mutate / paths_to_exclude / runner
[tool.mutmut]
paths_to_mutate = ["app/"]
paths_to_exclude = ["app/db/", "app/routers/test_admin.py"]
runner = "uv run pytest tests/test_behavioral_*.py -x --tb=no -q"
TOML
6. uv run mutmut run || true # populates .mutmut-cache
7. scripts/mutation_check.sh generated/<fixture> "$MUTATION_THRESHOLD" # gate (90 today)Step 4 is the safety net: if behavioral tests don't pass against the
unmutated app, mutmut runs are noise. Step 5 writes the mutmut config inline
because mutmut auto-discovers [tool.mutmut] in pyproject.toml; the runner
uses --tb=no (less noise; mutmut just needs the exit code) while step 4's
sanity check uses --tb=short (so a real failure is debuggable). Step 6
runs with || true so mutmut's nonzero exit on any survived mutation
doesn't short-circuit the workflow; the gate decision belongs to the
threshold script.
The TestClient mode (SPEC_TEST_INPROC=1)
mutmut mutates files on disk. For the runner to see those mutations
the test has to import app/ fresh per invocation. The default conftest
sends HTTP to an external uvicorn, which gets loaded once and pins the
original bytecode; every mutation would survive trivially. Setting
SPEC_TEST_INPROC=1 makes tests/conftest.py instantiate
fastapi.testclient.TestClient(app) against from app.main import app, so
each pytest process imports the (mutated) app fresh. Postgres is still in
the loop; the FastAPI request handler runs in-process, but its DB calls go
to the real database.
The workflow sets this env var globally; local users can opt in the same way.
The threshold script
scripts/mutation_check.sh
parses mutmut results --all True and computes:
score = killed / (killed + survived + suspicious + timeout)skipped and no_tests mutations are excluded from the denominator: they
are not a quality signal (mutmut couldn't run the test for some
infrastructure reason; a survived mutation by contrast is a real gap). On
failure the script lists each survived/suspicious/timeout mutation with its
diff (mutmut show <id>), so the workflow log alone is enough to start
investigating.
Threshold defaults to 90. To run the gate locally:
sbt "cli/run compile --framework fastapi --db postgres --with-tests --ignore-verify --out /tmp/svc fixtures/spec/safe_counter.spec"
cd /tmp/svc && uv sync --all-extras
# Postgres on :5432 with user/pass/db = "safe_counter"
docker run -d --rm -p 5432:5432 \
-e POSTGRES_USER=safe_counter \
-e POSTGRES_PASSWORD=safe_counter \
-e POSTGRES_DB=safe_counter \
postgres:17-alpine
DATABASE_URL=postgresql+asyncpg://safe_counter:safe_counter@localhost:5432/safe_counter \
uv run alembic upgrade head
cat >> pyproject.toml <<'TOML'
[tool.mutmut]
paths_to_mutate = ["app/"]
paths_to_exclude = ["app/db/", "app/routers/test_admin.py"]
runner = "uv run pytest tests/test_behavioral_*.py -x --tb=no -q"
TOML
SPEC_TEST_INPROC=1 ENABLE_TEST_ADMIN=1 \
DATABASE_URL=postgresql+asyncpg://safe_counter:safe_counter@localhost:5432/safe_counter \
uv run mutmut run
# From the spec_to_rest checkout (adjust the path to your clone):
bash ~/Desktop/spec_to_rest/scripts/mutation_check.sh . 90Reading a failing report
A typical failure block in the workflow log (mutmut_check.sh writes the
score line to stdout, the failure breakdown and per-mutation diffs to
stderr; CI logs concatenate both):
mutmut score: 86.45% (killed=45 evaluated=52; survived=5 suspicious=2 timeout=0; skipped=0 no_tests=3)
MUTATION SCORE 86.45% < threshold 90%
survived/suspicious/timeout mutations:
--- app.services.url_mapping.x_list_all__mutmut_2 ---
# app.services.url_mapping.x_list_all__mutmut_2: survived
--- app/services/url_mapping.py
+++ app/services/url_mapping.py
@@ -34,5 +34,5 @@
async def list_all(self) -> list[UrlMappingRead]:
- result = await self._session.execute(select(UrlMapping))
+ result = await self._session.execute(select(None))
rows = result.scalars().all()
return [UrlMappingRead.model_validate(row) for row in rows]The example uses app/services/url_mapping.py because that's the entity-named
service file the codegen actually emits for url_shortener.spec (entity
UrlMapping snake-cased). Without --with-synthesis, codegen emits service
stubs that raise NotImplementedError for any operation classified
LLM_SYNTHESIS; the live mutation surface is therefore dominated by framework
code (app/main.py, app/database.py, read-only CRUD shells like list_all
above). Two shipped levers can widen that surface:
- Phase 6 handler synthesis from
ensuresclauses. All seven M6 wedges are merged: operation classification (#31), Dafny signature generation (#32), LLM integration (#28), CEGIS feedback loop (#29),compile --with-synthesisDafny-to-target translation (#27), graduated fallback (#30), and DafnyPro-style hint augmentation (#229). Pass--with-synthesisto the matrix command and mutmut will mutate verified Dafny bodies instead ofNotImplementedErrorstubs. The current matrix does not enable synthesis (CEGIS runs are wall-clock heavy compared to a 90-minute mutation budget); flipping it on is a workflow tweak. - Runtime enforcement of invariants/temporals
(#86). Generated
services carry invariant / temporal property tests, OpenAPI
x-invariant/x-temporalannotations, and a stateful-test temporal observer. The checks live around the bodies, so they raise the kill rate even without synthesis: a mutation that breaks an invariant gets caught by the observer rather than slipping through a per-clause behavioral test.
For each survived mutation, decide:
- Genuine gap. The spec ensures clause that should have caught this
mutation is not being translated to a property test, or the property test
is too lax. Fix is in
Behavioral.scalaor its callers. - Equivalent mutation. The mutation produces semantically identical
code (e.g.,
await session.commit()andawait session.flush()agreeing on some paths, or a guard whose condition is implied by an earlier guard). Equivalents are inevitable; the 10% slack in the 90% threshold absorbs them. - Test infrastructure gap. The property test runs but its assertions
don't cover this code path (e.g., a status-code assertion that's lax
enough to accept the mutated 200 alongside the original 201). Tighten the
ensures expression in the spec or the assertion in
Behavioral.scala.
The mutation IDs are stable across runs as long as app/ source positions
don't shift, so an investigation can pin a specific ID and apply it on
disk to reproduce locally:
cd generated/url_shortener
uv run mutmut apply app.services.url_mapping.x_list_all__mutmut_2
git diff app/ # see the mutation
uv run pytest tests/test_behavioral_url_shortener.py -x # observe what (didn't) catch it
git checkout app/ # revertThreshold rationale
Per-fixture threshold is a single number passed to
scripts/mutation_check.sh as the second argument. Today every matrix
entry uses 90. If a fixture proves to have an unusually high equivalent-
mutation rate (e.g., heavy SQLAlchemy session boilerplate that mutmut keeps
mutating between morally-equivalent shapes), the per-fixture threshold can
be lowered in the workflow without changing the script. We prefer that to
a global lower bar.
Excluded paths
paths_to_exclude skips:
app/db/: Alembic migrations and base SQLAlchemy metadata. Mutating schema definitions yields runtime errors that aren't a test-quality signal.app/routers/test_admin.py: the admin router only exists to support tests. Mutating it would change the test harness rather than the code under test.
Routers (app/routers/<entity-plural>.py), services
(app/services/<entity>.py), schemas (app/schemas/<entity>.py), and
models (app/models/<entity>.py) are all in scope, plus the framework
glue (app/main.py, app/config.py, app/database.py, app/redaction.py).
Scala-side mutation gate (stryker4s)
The mutmut gate above is a consumer-side check: it tests whether the generated behavioral suite catches mutations to the generated FastAPI service. It does not test the generator. A bug in our Scala generator that consistently emits the same wrong code on every run would survive, because the consumer tests are written against whatever the generator produced.
.github/workflows/mutation-scala.yml
runs stryker4s
nightly (0 4 * * *) on the convention and lint modules, the two pure
hand-written domain modules with the largest under-tested surface (the
ir module is mostly Isabelle-extracted code already proven correct, and
profile/testgen are out of scope for this round). Configuration lives
in stryker4s.conf
at the repo root; the first run is report-only (thresholds.break = 0,
the no-fail default in stryker4s) so we can baseline the kill rate before
setting a gate.
Non-goals
- Per-PR mutation testing. Too slow; the PR-time gate is the structural
- behavioral + stateful conformance suite.
- Mutating Isabelle-extracted Scala (
modules/ir/.../generated/). That code is mechanically extracted from a verified Isabelle theory; surviving mutants there carry no test-quality signal, since the Isabelle proof is the correctness gate. - Alternative engines (cosmic-ray, MutPy). mutmut is the canonical choice today; switching engines is a much larger lift than tuning the current setup.
Test Generation
--with-tests emits a native behavioral/stateful/structural conformance suite (per target language) + a gated admin router from spec ensures/requires/invariants
Diff migrations for schema evolution
Snapshot-driven incremental migrations across Alembic (Python), golang-migrate (Go), and Prisma (TS)