Test Generation

--with-tests emits a native behavioral/stateful/structural conformance suite (per target language) + a gated admin router from spec ensures/requires/invariants

spec-to-rest compile --with-tests produces a property-test conformance suite alongside the generated service, emitted in the target's own language — pytest + Hypothesis + Schemathesis for python-fastapi-*, Vitest + fast-check for ts-express-*, go test + rapid for go-chi-* (all nine targets: each language × postgres / sqlite / mysql). Each ensures clause becomes a positive property test; each <input> in <state> requires clause becomes a negative test (asserting 4xx); each global invariant becomes a post-operation check. The examples on this page show the Python (fastapi) rendering; see Native multi-language emission for how the same spec-derived logic renders into TypeScript and Go.

Looking for the design rationale and the full vision (structural + behavioral + stateful layers, conformance runner, mutation testing)? See Test Generation Pipeline. This page documents what --with-tests actually delivers today; the nightly mutation-testing CI gate that keeps the suite honest is documented separately on the Mutation Testing page.

Quick start

sbt "cli/run compile --with-tests --out /tmp/my-service fixtures/spec/url_shortener.spec"
cd /tmp/my-service
cp .env.example .env

# Full docker-compose pipeline (boot, run all 3 layers, tear down):
make test-conformance-docker PROFILE=smoke

# Or against an already-running service:
ENABLE_TEST_ADMIN=1 uv run uvicorn app.main:app --host 0.0.0.0 --port 8000 &
make test-conformance PROFILE=smoke

The generated docker-compose.yml already forwards ENABLE_TEST_ADMIN from the host to the app container, so ENABLE_TEST_ADMIN=1 docker compose up works without manual edits.

The ENABLE_TEST_ADMIN=1 env var is required: tests use a small admin router (/__test_admin__/reset and /__test_admin__/state) to control state and read it back. Production deployments must leave this unset; the router returns 403 by default.

Native multi-language emission

The conformance logic (which clauses become tests, the skip algebra, the state/output scoping, the strategy derivation from IR types) is single-sourced and language-agnostic. Only the leaf rendering varies, behind the Backend.scala seam in modules/testgen:

ExprBackend — translates an IR expression to the target language (ExprToPython / TsExprBackend / GoExprBackend).
StrategyBackend — renders a spec type into a generator (PythonHypothesisStrategy st.* / TsFastCheckStrategy fc.* / GoRapidStrategy rapid).
HarnessTemplates — the language-native scaffold (pytest+conftest / Vitest+_client.ts / go test+conf_*.go), selected by TestBackend.

The Behavioral / Stateful / Structural decision predicates are shared by all three; the parallel Ts* / Go* emitters reuse them and only swap the rendered tokens. The Python (fastapi) path is held byte-identical through every slice, so it serves as the differential oracle: CI runs the Python suite and the native suite against the same contract, and any behavioral disagreement is a mechanically-caught translator bug.

Target	Behavioral / stateful	Structural	Run
`python-fastapi-*`	pytest + Hypothesis	Schemathesis (full OpenAPI fuzzing + spec-derived checks)	`python tests/run_conformance.py <profile>` / `make test-conformance`
`ts-express-*`	Vitest + fast-check	structural-lite (reuse the Strategies seam: type-valid input per non-stub op, assert no 5xx)	`node tests/run_conformance.mjs <profile>` / `npm test`
`go-chi-*`	`go test` + rapid	structural-lite (same)	`bash tests/run_conformance.sh <profile>` / `go test -tags conformance ./tests/...`

Structural-lite (ts/go): schemathesis has no TS/Go peer, so rather than reimplement an OpenAPI/JSON-Schema fuzzer the structural phase reuses the IR Strategies seam — for every non-stub operation it generates type-valid inputs, calls with a per-case /__test_admin__/reset, and asserts the server did not 5xx (schemathesis's core "documented, no server error" check, minus full schema validation). Fail-loud stubs are honest-skipped exactly as the Python schemathesis renderer schema.exclude(...)s them.

The /__test_admin__/* reset/state/seed contract is identical across targets; only its implementation is stack-specific (AdminRouter.scala → FastAPI, AdminRouterTs.scala → Express src/routes/testAdmin.ts, AdminRouterGo.scala → Go internal/testadmin/admin_conformance.go). Go links statically, so there is no dynamic-import gate: codegen always emits a !conformance no-op testadmin.Register + main.go wiring, and --with-tests emits the real conformance-tagged router — keep the test-admin code out of production binaries via the build tag, not a runtime branch.

Each backend's per-target page documents the exact file layout and run commands: python-fastapi · ts-express · go-chi.

What gets generated

The table below is the python-fastapi-* layout; the Vitest and go test backends emit the analogous set in their own conventions (see the per-target pages and Native multi-language emission above).

Generated files, full table (python-fastapi)

File	Source	What it is
`app/routers/test_admin.py`	`AdminRouter.scala`	`/__test_admin__/reset` truncates entity tables; `/__test_admin__/state` returns spec state as JSON; `/__test_admin__/seed/<entity>` (per entity that participates in a `TransitionDecl`) inserts a row directly so transition tests can drive entities into a chosen status. All endpoints 403 unless `ENABLE_TEST_ADMIN=1`.
`tests/__init__.py`	empty	makes `tests` a Python package
`tests/conftest.py`	static template	`httpx` client + session-scoped fixture that skips the whole suite with a clear message if the service is unreachable or the admin router isn't enabled
`tests/predicates.py`	rendered from `ir.predicates`	`_powerset` plus one Python helper per spec/preamble predicate (auto-derived from the body)
`tests/strategies.py`	`Strategies.scala`	one `def strategy_<type>(): return ...` per `TypeAliasDecl` and `EnumDecl`, plus one `def strategy_<entity>()` (returning `st.fixed_dictionaries`) per entity that participates in a `TransitionDecl`
`tests/strategies_user.py`	static template	empty stub for user-supplied strategy functions referenced from `<Type>.strategy = "module:symbol"` convention rules. Preserved across re-compile.
`tests/redaction.py`	static template	`redact(strategy)` wrapper + `_RedactedStr` mask class for sensitive Hypothesis values (M5.8)
`tests/test_behavioral_<service>.py`	`Behavioral.scala`	the property tests themselves
`tests/test_stateful_<service>.py`	`Stateful.scala`	a Hypothesis `RuleBasedStateMachine` exercising multi-step operation sequences
`tests/test_structural_<service>.py`	`Structural.scala`	Schemathesis-driven structural tests: schema fuzzing + spec-derived custom checks + OpenAPI-Links state machine
`tests/run_conformance.py`	static template	three-phase orchestrator (structural $\to$ behavioral $\to$ stateful) with JUnit XML output and a unified exit code
`tests/_testgen_skips.json`	testgen	machine-readable list of clauses that were not turned into tests, with reasons
`pytest.ini`	static template	disables `pytest-xdist` parallelism (admin-router state is global)

Three test kinds emitted per operation

1. Positive ensures tests

For each operation whose requires does not reference state, every ensures clause that the IR-to-Python translator can handle becomes a property test:

@given(url=strategy_long_url())
@settings(max_examples=20, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_shorten_ensures_0(url):
    """ensures: (code not in pre(store))"""
    client.post("/__test_admin__/reset")
    pre_state = client.get("/__test_admin__/state").json()
    response = client.post("/shorten", json={"url": url})
    assume(response.status_code == 201)
    assert response.status_code == 201, response.text
    response_data = response.json() if response.content else {}
    post_state = client.get("/__test_admin__/state").json()
    assert ((response_data["code"]) not in (pre_state["store"])), "ensures violated: ensures: (code not in pre(store))"

State-dependent preconditions are partially covered. If requires references a state field (e.g., requires: code in store), positive ensures tests for the generic case are skipped; driving the system into the precondition-satisfying state requires either calls to other operations or direct seed-writes. For the transition case (operation referenced by a via clause in a TransitionDecl), M5.9 closes this gap with dedicated transition tests (see below). The clause still appears in _testgen_skips.json, but the reason text flips to state-dependent precondition; covered by transition tests (M5.9) for transition operations and falls back to ... covered by stateful tests (M5.2) for ops bundled into the state machine ... for the residual non-transition operations (GetTodo, UpdateTodo, DeleteTodo in todo_list). ExprToPython translates every IR Expr shape, so the translator coverage gap is closed.

2. Negative `<input> in <state>` tests

For exactly the <input> in <state> requires pattern, a negative test is emitted that generates a fresh-state input via Hypothesis, asserts the input isn't in the (empty) state, calls the operation, and asserts a 4xx response:

@given(code=strategy_short_code())
@settings(max_examples=10, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_resolve_negative_code_not_in_store(code):
    """requires 'code in store' (negative): missing key returns 4xx."""
    client.post("/__test_admin__/reset")
    pre_state = client.get("/__test_admin__/state").json()
    assume(code not in pre_state.get("store", {}))
    response = client.get(f"/{code}")
    assert 400 <= response.status_code < 500, f"expected 4xx, got {response.status_code}: {response.text}"

A second negative pattern recognises status restrictions on existing entries: <state>[<input>].<enum-field> = <required-value> (the precondition that the entry, once it exists, must be in a particular status). When the underlying entity is in a TransitionDecl (so /__test_admin__/seed/<entity> exists), a negative test is emitted that seeds the entity in any other enum value of the field, picked by st.sampled_from(<all-other-values>), and asserts a 4xx response:

@given(row=strategy_order(), wrong_status=st.sampled_from(["DRAFT", "PAID", "SHIPPED", "DELIVERED", "CANCELLED", "RETURNED"]), amount=strategy_money())
@settings(max_examples=10, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_record_payment_negative_orders_status_not_placed(row, wrong_status, amount):
    """requires 'orders[order_id].status = PLACED' (negative): wrong status returns 4xx."""
    client.post("/__test_admin__/reset")
    row = dict(row)
    row["status"] = wrong_status
    seed = client.post("/__test_admin__/seed/order", json=row)
    assume(seed.status_code == 201)
    seeded_id = seed.json()["id"]
    response = client.post(f"/orders/{seeded_id}/payments", json={"amount": amount})
    assert 400 <= response.status_code < 500, f"expected 4xx, got {response.status_code}: {response.text}"

This covers business-logic illegal-from preconditions on non-via operations (e.g., ecommerce AddLineItem requires orders[order_id].status = DRAFT but is not the via of any TransitionDecl). For via operations the same pattern emits an additional negative alongside the per-illegal-from transition negatives, they assert the same qualitative property from different angles.

Other requires shapes (predicate violation, ordering, set-content checks) are tracked as follow-ups and skipped with reason in _testgen_skips.json.

3. Post-operation invariant checks

For each global invariant, one test per operation that has a non-state-dep requires:

@given(url=strategy_long_url())
@settings(max_examples=20, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_shorten_invariant_all_ur_ls_valid(url):
    """invariant allURLsValid: (all c in store | isValidURI(store[c]))"""
    client.post("/__test_admin__/reset")
    pre_state = client.get("/__test_admin__/state").json()
    response = client.post("/shorten", json={"url": url})
    assume(response.status_code == 201)
    response_data = response.json() if response.content else {}
    post_state = client.get("/__test_admin__/state").json()
    assert all(is_valid_uri(post_state["store"][c]) for c in (post_state["store"])), "invariant violated: invariant allURLsValid: (all c in store | isValidURI(store[c]))"

4. Temporal predicates (#86)

temporal X: always(P) and temporal X: eventually(P) declarations participate in test emission alongside invariants. The always(P) form is structurally an invariant (it asserts P after every step), so behavioral and stateful tests emit it the same way an invariant decl is emitted, distinguished only by the method name (temporal_always_<name> vs invariant_<name>).

eventually(P) is liveness-shaped: "at some point during the trace, P held." The build-time Alloy proof is authoritative (it shows reachability up to the configured scope), but at runtime we emit a per-trace witness. In the stateful state machine:

_reset() initializes a sticky flag self._eventually_seen_<name> = False
An @invariant() observer evaluates P after every step and ORs it into the flag if true
The state machine's teardown(self) asserts the flag was set

@invariant()
def temporal_eventually_observe_some_user_exists(self):
    """temporal eventually(someUserExists): (some u in users | u.is_active)"""
    post_state = client.get("/__test_admin__/state").json()
    if any(u["is_active"] for u in post_state["users"].values()):
        self._eventually_seen_some_user_exists = True

def teardown(self):
    assert self._eventually_seen_some_user_exists, \
        "temporal eventually never observed in trace: someUserExists: ..."

This is strictly weaker than the build-time Alloy proof: each Hypothesis case observes a single bounded trace, so a trace too short to reach P will fail the teardown even on a sound spec. If a fixture's state machine defaults (max_examples=25, stateful_step_count=20) don't reach P reliably, lengthen the trace via Hypothesis settings or accept the gap and lean on the verifier.

fairness(op) is parsed but emits a _testgen_skips.json entry only; the verifier already rejects it (v1 does not implement trace-based verification), so a runtime stub for an unverifiable construct creates an unreachable code path. Tracked alongside v2 fairness verification.

always / eventually decls also surface in the generated openapi.yaml as top-level x-invariant and x-temporal extensions, machine-readable for documentation generators and downstream tooling. Each x-temporal entry carries {kind: always|eventually|fairness, expr: <pretty-printed predicate>}.

x-invariant:
  usersAreValid: (all u in users | u.is_active)
x-temporal:
  someUserExists:
    kind: eventually
    expr: '(some u in users | u.is_active)'
  allUsersAlwaysValid:
    kind: always
    expr: (all u in users | u.is_active)

Transition tests (M5.9)

For specs that declare a state machine via transition X { entity: E; field: f; FROM -> TO via Op ... }, testgen synthesizes per-via-operation behavioral tests that drive the entity directly into each from status using a generated seed admin endpoint, then exercise the operation:

Positive test per legal (from, to) rule. Seed entity in from, call the via operation, assert 2xx and the entity's field post-state equals to.
Negative test per illegal from (every value of the field's enum that is not a from-state for via X). Seed entity in that state, call the via operation, assert 4xx.

Setup uses POST /__test_admin__/seed/<entity>, emitted by AdminRouter.scala for every entity referenced by a TransitionDecl. The endpoint accepts a JSON dict, parses any DateTime-typed columns from ISO strings, inserts the row, and returns the new primary key. Like /reset and /state, it 403s unless ENABLE_TEST_ADMIN=1.

Row strategies come from Strategies.scala's new strategy_<entity>() function, emitted only for transition entities; non-transition specs (safe_counter, url_shortener) generate byte-identical strategies.py to before.

@given(row=strategy_todo())
@settings(max_examples=20, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_start_work_transition_todo_to_in_progress(row):
    """transition StartWork: TODO -> IN_PROGRESS (post-state status = IN_PROGRESS)"""
    client.post("/__test_admin__/reset")
    row = dict(row)
    row["status"] = "TODO"
    seed = client.post("/__test_admin__/seed/todo", json=row)
    assume(seed.status_code == 201)
    seeded_id = seed.json()["id"]
    response = client.post(f"/todos/{seeded_id}/start")
    assert response.status_code == 200, response.text
    post_state = client.get("/__test_admin__/state").json()
    bucket = post_state.get("todos", {})
    entity_view = bucket.get(str(seeded_id)) or bucket.get(seeded_id)
    actual = entity_view.get("status") if isinstance(entity_view, dict) else entity_view
    assert actual == "IN_PROGRESS", f"expected status=IN_PROGRESS, got {actual!r}"

@given(row=strategy_todo())
@settings(max_examples=10, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_archive_transition_illegal_from_in_progress(row):
    """transition Archive: from=IN_PROGRESS is illegal (no rule); SUT must reject 4xx"""
    client.post("/__test_admin__/reset")
    row = dict(row)
    row["status"] = "IN_PROGRESS"
    seed = client.post("/__test_admin__/seed/todo", json=row)
    assume(seed.status_code == 201)
    seeded_id = seed.json()["id"]
    response = client.post(f"/todos/{seeded_id}/archive")
    assert 400 <= response.status_code < 500, f"expected 4xx, got {response.status_code}: {response.text}"

What gets generated for `todo_list.spec`

`via` operation	Legal `(from, to)`	Positive tests	Illegal `from` (no rule)	Negative tests
`StartWork`	`TODO -> IN_PROGRESS`	1	`IN_PROGRESS, DONE, ARCHIVED`	3
`Complete`	`IN_PROGRESS -> DONE`	1	`TODO, DONE, ARCHIVED`	3
`Reopen`	`DONE -> IN_PROGRESS` (guarded by `updated_at > completed_at`)	1	`TODO, IN_PROGRESS, ARCHIVED`	3
`Archive`	`TODO -> ARCHIVED`, `DONE -> ARCHIVED`	2	`IN_PROGRESS, ARCHIVED`	2

16 transition tests total, including the Reopen positive whose guard updated_at > completed_at is satisfied deterministically by the recognizer in #152 (see "Guarded positive transitions" below). PauseWork is referenced in the spec's transition graph but has no operation declaration in todo_list.spec, so it produces a transition[PauseWork] skip in _testgen_skips.json with reason no operation named 'PauseWork' for via clause.

Guarded positive transitions (#152)

Rules with a when <guard> clause are recognized when the guard reduces to a conjunction (and), or single instance, of these shapes:

Shape	Example	Emitted fix-up
ordered comparison on two entity fields, both DateTime-ish	`updated_at > completed_at`	`row["updated_at"] = (datetime.datetime.fromisoformat(row["completed_at"]) + datetime.timedelta(seconds=1)).isoformat()` (with optional `if row[..] is None` anchor when the RHS is `Option[DateTime]`)
ordered comparison on two entity fields, both numeric	`score > threshold`	`row["score"] = row["threshold"] + 1` (or `+ 0`/`- 1` for `>=`/`<=`/`<`)
ordered comparison vs. numeric literal	`score > 10`	`row["score"] = 10 + 1`
equality with enum or literal RHS	`tier = GOLD`, `count = 7`, `name = "x"`	`row["tier"] = "GOLD"`
equality with `none` on Option field	`closed_at = none`	`row["closed_at"] = None`
existence (`!= none`) on Option field	`closed_at != none`	`if row["closed_at"] is None: row["closed_at"] = <inner-default>`
cardinality on Set/Seq field	`#tags > 2`, `len(tags) >= 1`, `#tags = 0`	`row["tags"] = ["x0", "x1", "x2"]` (size and fillers depend on inner type)
literal membership in Set/Seq field	`"URGENT" in tags`	`row["tags"] = list(row["tags"]) + ["URGENT"]`
transition-field self-equality matching `from`	rule `LOW -> HIGH ... when phase = LOW`	no extra fix line, already satisfied by the `row[<field>] = LOW` step
top-level `not (...)` (De Morgan inversion)	`not (a > b)`	recurses on `a <= b`

Recognized clauses can be combined freely with and. The arithmetic-shift form on ordered comparisons composes with field-level where constraints already enforced by the strategy because we keep the strategy-generated RHS value and shift LHS off it.

Anything outside the table above keeps the skip with reason guard '<expr>' not representable in seed dict (see #152). Cases that fall back:

User function calls other than len(...) (e.g. is_valid_email(addr), paymentCaptured(order_id)): would need runtime predicate evaluation.
Set/map/seq arithmetic beyond cardinality and membership (e.g. tags - banned = {}, tags = other_tags).
Nested field access on either side (e.g. parent.status = ARCHIVED, priority.level > 3): would need multi-entity seed orchestration; the recognizer only accepts bare Identifier operands of the seeded entity.
Pre(...) / Prime(...). References system pre-/post-state, not the row being seeded; structurally not row-local.
Compound RHS expressions other than the literal forms in the table (e.g. a > b + 1, a > some_function()).
Mismatched operand kinds in ordered comparison (one DateTime-ish, one numeric); the recognizer requires the same kind on both sides.
Transition-field self-equality contradicting the rule's from (e.g. rule LOW -> HIGH ... when phase = HIGH; the guard can never hold for any row of phase=LOW).
Conjunctions where any branch fails recognition. All-or-nothing across an and tree; a partial pass would silently miss the unrecognized branch.
Conflicting fixes within a conjunction (e.g. a > b and a < b produces two writes to row[a] with different RHS); recognition fails fast rather than emitting an unsatisfiable test.
Top-level or / implies. Pick-a-branch ambiguity; reduce to a conjunction and re-spec the rule if you want a deterministic test.

Body and query inputs on `via` operations (#155)

When the via op takes parameters beyond the path id (e.g., RecordPayment(order_id: OrderId, amount: Money) mounted at POST /orders/{order_id}/payments), the body/query inputs are generated through the same Strategies.expressionFor(...) used in non-transition tests, and wired into the @given decorator alongside row=strategy_<entity>():

@given(row=strategy_order(), amount=strategy_money())
@settings(max_examples=10, suppress_health_check=[HealthCheck.function_scoped_fixture])
def test_record_payment_transition_illegal_from_draft(row, amount):
    """transition RecordPayment: from=DRAFT is illegal (no rule); SUT must reject 4xx"""
    client.post("/__test_admin__/reset")
    row = dict(row)
    row["status"] = "DRAFT"
    seed = client.post("/__test_admin__/seed/order", json=row)
    assume(seed.status_code == 201)
    seeded_id = seed.json()["id"]
    response = client.post(f"/orders/{seeded_id}/payments", json={"amount": amount})
    assert 400 <= response.status_code < 500, f"expected 4xx, got {response.status_code}: {response.text}"

Body inputs go to json={...}, query inputs to params={...}. Sensitive-field redaction and per-operation test_strategy overrides apply, since the same expressionFor path is reused. If any non-path input has no known strategy (e.g., Map[K, V], Relation[K, V], an unknown named type), the entire via is skipped with a reason naming the un-generable input.

If a non-path input shares a name with a generated test local, row, seed, seeded_id, response, client, pre_state, post_state, wrong_status, the input is aliased to _arg_<name> (with a numeric suffix on collision) for the function parameter and the @given keyword, while the JSON / query dict key keeps the original parameter name. So a spec input named seed becomes @given(..., _arg_seed=...) / def fn(row, _arg_seed) / json={"seed": _arg_seed}.

The positive transition still requires the via op's other requires clauses (besides the from-state pin) to be satisfied by the generated values; if the SUT enforces stricter preconditions (e.g., requires: amount = orders[order_id].total), the positive test will fail by design, that's a real spec-vs-SUT divergence and surfaces where it should. (Recognising such relational preconditions on body inputs is a separate concern from #155.)

Other skipped categories

Non-enum transition field. If field: <name> resolves to a type other than an enum (or alias of an enum), the entire TransitionDecl is skipped; illegal-from enumeration is undefined for non-enum status fields.
Unknown via operation. A via X whose X has no matching operation declaration is skipped. Spec lint should catch this; the skip is a defensive backstop.
Multi-path or zero-path via operation. Transition tests need exactly one path parameter to identify the seeded entity; nested-path shapes (e.g., /orders/{order_id}/items/{item_id}) and zero-path shapes are skipped pending multi-entity seed orchestration.
Non-generable body or query input. If Strategies.expressionFor(...) returns Skip for any non-path input (most commonly Map/Relation-typed bodies), the via is skipped with a reason naming the input and the un-generable type.

Stateful tests (M5.2)

Alongside the per-clause behavioral tests, testgen emits one tests/test_stateful_<service>.py containing a Hypothesis RuleBasedStateMachine that exercises multi-step operation sequences against the live SUT. Each spec operation becomes an @rule performing the real HTTP call. Entity ids returned from Create-classified operations flow into other rules through Bundles, so a Resolve rule receives a code that was just produced by a Shorten rule, and a Delete rule consumes the code from the bundle. Global invariants run after every step against /__test_admin__/state.

class UrlShortenerStateMachine(RuleBasedStateMachine):
    url_mapping_ids = Bundle("url_mapping_ids")

    @initialize()
    def _reset(self):
        client.post("/__test_admin__/reset")

    @rule(target=url_mapping_ids, url=strategy_long_url())
    def shorten(self, url):
        response = client.post("/shorten", json={"url": url})
        assert response.status_code == 201, response.text
        return response.json()["code"]

    @rule(code=url_mapping_ids)
    def resolve(self, code):
        response = client.get(f"/{code}")
        assert response.status_code == 302, response.text

    @rule(code=consumes(url_mapping_ids))
    def delete(self, code):
        response = client.delete(f"/{code}")
        assert response.status_code == 204, response.text

    @invariant()
    def invariant_all_ur_ls_valid(self):
        post_state = client.get("/__test_admin__/state").json()
        assert all(is_valid_uri(post_state["store"][c]) for c in post_state["store"])

TestStatefulUrlShortener = UrlShortenerStateMachine.TestCase

Bundle inference

Operation kind (convention layer)	Role in the state machine
`Create` / `CreateChild` (returns an entity)	`@rule(target=<entity>_ids, ...)`, pushes the response id into the bundle
`Delete` (id-typed input)	parameter via `consumes(<entity>_ids)`, pops the id
Read / Update / Replace / Transition / SideEffect (id-typed input)	parameter via `<entity>_ids` (non-consuming draw)
any operation with no entity-id input or output	parameter-less rule that just exercises the endpoint

The <entity>_ids bundle name follows the entity's snake-cased name (e.g., url_mapping_ids, todo_ids). The id projection from the response uses the entity's primary-key field (id if present, otherwise the first declared field, for url_shortener this is code).

Per-status bundles for `transition`-driven entities (#153)

When an entity has a TransitionDecl over an enum field AND every Create operation deterministically sets the initial status (e.g., todo.status = TODO in the ensures), testgen splits the single <entity>_ids bundle into one bundle per enum value:

Entity	Status enum value	Bundle name
`Todo`	`TODO`	`todo_todo_ids`
`Todo`	`IN_PROGRESS`	`todo_in_progress_ids`
`Todo`	`DONE`	`todo_done_ids`
`Todo`	`ARCHIVED`	`todo_archived_ids`

Rules then route ids through bundles by status:

Create: @rule(target=<initial-status-bundle>, ...), pushes the new id into the bundle for the status the ensures sets.
Unguarded transition (no when clause): one Python rule per TransitionRule, named <via>_from_<from>_to_<to>. Decorator is @rule(target=<to>_bundle, <id>=consumes(<from>_bundle)). The body asserts strict success and returns the id, Hypothesis transfers it from <from> into <to>. Including _to_<to> in the function name keeps each rule unique even when several rules share the same via and from but differ in to. An operation that fires from multiple from-states (e.g., Archive from both TODO and DONE) emits one rule per source bundle (archive_from_todo_to_archived, archive_from_done_to_archived).

Guarded transition (has a when clause that bundle membership cannot guarantee): same consumes+target shape, but the body branches:

if response.status_code == <success>:
    return id              # SUT accepted -> id moves from -> to
elif 400 <= response.status_code < 500:
    return multiple()      # guard failed -> id is consumed and dropped (no move)
else:
    assert False, ...

multiple() with no arguments tells Hypothesis "no result"; the id leaves the source bundle without being added to the target, keeping bundle/SUT bookkeeping consistent.

Read / Update / Replace / SideEffect with an id parameter:
- If requires is just <id> in <state> (membership only), the rule draws from st.one_of(<all-status-bundles>) and asserts strict success, bundle membership fully covers the precondition by construction.
- If requires includes <id> in <state> and <state>[id].<transition-field> (= X | in {X, Y, ...}), the rule draws from st.one_of(<bundle X>, <bundle Y>, ...) and is strict. Repeated conjunctive restrictions on the same field are intersected (AND semantics).
- If requires includes any unrecognized non-key conjunct (or a state[id].field access on a field that isn't the entity's transition field), the rule draws from st.one_of(<all-status-bundles>) non-consuming and falls back to loose assertion.
Delete:
- With a recognized status restriction that narrows to a single bundle, consumes(<bundle>), strict.
- With just <id> in <state> membership, a multi-bundle restriction, or any unrecognized requires conjunct, non-consuming st.one_of(...) + loose. (consumes(st.one_of(...)) is not supported by Hypothesis, and consuming an id that the SUT may reject would leak bundle/SUT state.)

Entities without a TransitionDecl over an enum field, or whose Create operation doesn't deterministically set an initial status, fall back to the legacy single-bundle shape. url_shortener and safe_counter outputs stay byte-identical.

Strict vs loose status assertions

For each rule, testgen decides whether to assert the success status strictly or to also accept any 4xx (the SUT rejecting because its own preconditions failed):

Strict (assert response.status_code == <success>): used for Create rules, unguarded transition rules (bundle membership guarantees the precondition by construction), and rules whose every requires clause is either trivially true or a <input> in <state> pattern that is satisfied by construction via bundle membership.
Loose (if-elif-else, accepting <success> or 400..499): used for rules with any other state-dependent precondition (e.g., transition guards like todos[id].status = TODO that aren't recognised by the per-status splitter, or when-clause time guards like updated_at > completed_at). A 4xx response there is the SUT correctly rejecting an unsatisfied precondition, not a bug.

Settings

Settings are attached to the state machine's TestCase (which the TestStateful<Service> alias refers to):

<Service>StateMachine.TestCase.settings = settings(
    max_examples=25,
    stateful_step_count=20,
    deadline=None,
    suppress_health_check=[HealthCheck.too_slow, HealthCheck.function_scoped_fixture],
)
TestStateful<Service> = <Service>StateMachine.TestCase

deadline=None is required because every step performs an HTTP round-trip; a per-step deadline trips on slow CI machines.

What stateful does not do (in M5.2)

It does not maintain a Python-side ghost model of state. The SUT, accessed via /__test_admin__/state, is the model. This trades shrink-output detail for emitter simplicity and avoids re-translating every ensures clause as a model update.
Per-rule ensures-asserts are handled by the behavioral tests, not duplicated here. Stateful surfaces multi-step bugs via the @invariant() checks that run after every step.

Structural tests (M5.3)

Alongside the per-clause behavioral tests and the hand-rolled stateful state machine, testgen emits a tests/test_structural_<service>.py that wires Schemathesis to the project's openapi.yaml. Schemathesis fuzzes every (method, path) declared in the schema and validates responses against the OpenAPI shape, catching wrong status codes, missing endpoints, content-type drift, and 5xx unhandled exceptions. The structural file is orthogonal to M5.2: the M5.2 stateful tests verify the spec contract; the M5.3 structural tests verify the schema contract.

The generated file is self-contained: it reads SPEC_TEST_BASE_URL and SPEC_TEST_PROFILE from the environment, declares the three profiles inline (see Profiles below), and validates SPEC_TEST_PROFILE against the allowed set with an explicit ValueError (no opaque KeyError):

import os
import schemathesis
from hypothesis import HealthCheck, settings
from tests.conftest import client

BASE_URL = os.environ.get("SPEC_TEST_BASE_URL", "http://localhost:8000")
PROFILE = os.environ.get("SPEC_TEST_PROFILE", "thorough")
PROFILES = {
    "smoke":      {"max_examples": 10,   "stateful_step_count": 3},
    "thorough":   {"max_examples": 100,  "stateful_step_count": 10},
    "exhaustive": {"max_examples": 1000, "stateful_step_count": 25},
}
if PROFILE not in PROFILES:
    raise ValueError(f"Invalid SPEC_TEST_PROFILE={PROFILE!r}. Expected one of: {', '.join(sorted(PROFILES))}")
_PROFILE = PROFILES[PROFILE]

schema = schemathesis.openapi.from_path("openapi.yaml")

def _check_invariant_metadata_consistent(response, case):
    """invariant metadataConsistent: dom(store) = dom(metadata)"""
    if response.status_code >= 500:
        return
    post_state = client.get("/__test_admin__/state").json()
    assert set(post_state["store"].keys()) == set(post_state["metadata"].keys()), \
        "invariant violated: metadataConsistent"

_ALL_CHECKS = (_check_invariant_all_ur_ls_valid,
               _check_invariant_metadata_consistent,
               _check_invariant_click_count_non_negative)

@schema.parametrize()
@settings(max_examples=_PROFILE["max_examples"], deadline=None,
          suppress_health_check=[HealthCheck.too_slow, HealthCheck.function_scoped_fixture])
def test_api_structural(case):
    client.post("/__test_admin__/reset")
    response = case.call(base_url=BASE_URL)
    if _ALL_CHECKS:
        case.validate_response(response, checks=_ALL_CHECKS)
    else:
        case.validate_response(response)

UrlShortenerLinksStateMachine = schema.as_state_machine()
TestStructuralLinksUrlShortener = UrlShortenerLinksStateMachine.TestCase

Profiles

A single env-var, SPEC_TEST_PROFILE, picks the depth tier:

Profile	`max_examples`	`stateful_step_count`	Use case
`smoke`	10	3	pre-commit hook, fast feedback
`thorough` (default)	100	10	CI on PRs
`exhaustive`	1000	25	nightly / release gate

SPEC_TEST_PROFILE=smoke pytest tests/test_structural_*.py -v

Spec-derived custom checks

For each global invariant that ExprToPython can translate, testgen emits one def _check_invariant_<name>(response, case) function. The check fetches /__test_admin__/state and asserts the translated predicate. Checks gate on response.status_code < 500 so 5xx unhandled exceptions surface via Schemathesis's own not_a_server_error check rather than a stale state read.

For Create operations whose ensures clauses reference only inputs and outputs (no pre(), no state, no '), testgen emits per-operation checks gated by (case.path, case.method, response.status_code), these run only when Schemathesis hits the matching endpoint with a successful status code. Clauses that don't fit this shape are recorded under the new structural_skipped array of tests/_testgen_skips.json.

Reset between cases

Before every parametrized case, client.post("/__test_admin__/reset") runs so global invariant checks stay meaningful across hundreds of fuzzed cases. The cost is one extra round-trip per case (~1 ms locally). The Links state machine, in contrast, fires Hypothesis's per-scenario __init__ on its own, no manual reset hook needed.

Two state machines, by design

The Schemathesis-built as_state_machine() infers links from response shapes and parameter-name matches against the OpenAPI document. The M5.2 hand-rolled <Service>StateMachine infers them from the spec's classified operations (Create $\to$ Bundle.target, Delete $\to$ consumes, etc.). Both run; Schemathesis fuzzes shapes the spec layer doesn't (e.g., body fields just outside the where constraint).

What structural does not do (in M5.3)

It does not emit explicit OpenAPI links: blocks; Schemathesis discovers links via Location headers and parameter-name heuristics. Explicit Links emission is tracked separately.
Entity-field where invariants (e.g., len(value) >= 6) are not translated to custom checks; Schemathesis already validates them as part of OpenAPI schema conformance at request and response time.
The structural skip rate on rich specs is high (~70-80% of ensures clauses reference state or pre()/primes); that is by design, behavioral and stateful layers cover those.

Conformance runner (M5.4)

tests/run_conformance.py is a static Python orchestrator that runs the three layers sequentially against an already-running service, emits JUnit XML per phase, and aggregates a single exit code. The Makefile wraps it with two targets:

# Service must already be reachable at SPEC_TEST_BASE_URL
make test-conformance PROFILE=smoke

# Brings up docker compose, waits for /health, runs, tears down
make test-conformance-docker PROFILE=thorough

Phase order is structural $\to$ behavioral $\to$ stateful. Each phase begins with POST /__test_admin__/reset (on top of the per-case resets in the test files themselves) so a wedged state from one phase cannot poison the next. The phase glob is expanded inside the runner; pytest receives concrete paths, so pytest's own collection-error path doesn't trigger when a phase has no matching files (a service with no entities legitimately has no stateful tests).

Profile	`max_examples`	Use case
`smoke`	10	pre-commit / fast feedback
`thorough` (default)	100	CI on PRs
`exhaustive`	1000	nightly cron

JUnit XML lands under results/<phase>-<profile>.xml. The generated GitHub Actions workflow (.github/workflows/ci.yml) calls make test-conformance after starting the service, picks the profile from the event type (exhaustive on schedule, thorough otherwise), and uploads results/ as an artifact. Cron is wired at 0 2 * * * for the nightly exhaustive run.

Exit codes:

0, all three phases passed
1, at least one phase failed
2, service unreachable, admin router disabled, or invalid PROFILE

The generated docker-compose.yml forwards ENABLE_TEST_ADMIN from the host environment to the app container, so ENABLE_TEST_ADMIN=1 docker compose up is sufficient, no manual edit required (the M5.1 caveat in the quick-start above is historical; M5.4 plumbs this for you).

Sensitive fields (M5.8)

Spec fields whose names match SensitiveFields.isSensitive (exact: password, password_hash, secret, token, api_key; suffix: _hash, _secret, _password, _api_key, _token) get sensitivity-aware handling at every stage of the generated suite:

Hypothesis behavioral tests. Sensitive operation inputs are wrapped in redact(...) from tests/redaction.py. The wrapper returns a _RedactedStr (a str subclass with __repr__ masking), so failing-example output shows password=<redacted len=N> instead of the raw value. JSON serialization is unaffected; the wire format is identical to a plain string.
Schemathesis structural tests. A @schemathesis.hook before_call registers itself with a _SENSITIVE_BODY_FIELDS frozenset scoped to the operations in the current spec; sensitive body fields are wrapped in _RedactedStr before the request is sent, so failing-case output is masked.
Service-side logs. app/redaction.py configures structlog with a redact_sensitive processor in the chain. Every log line (application code, uvicorn access logs, uvicorn errors) runs through the same processor; sensitive keys at any depth in the structured event dict become ***REDACTED***. The processor is a pure function; tests in tests/test_log_redaction.py cover it directly.

Per-field overrides

A test_strategy convention rule overrides the default redaction policy for a specific operation input or entity field:

service Demo {
  // ...
  conventions {
    Register.password.test_strategy = "live"      // send constraint-derived live values
    User.password_hash.test_strategy = "redacted" // force the placeholder

    // Equivalent string-qualifier syntax (mirrors http_header):
    // Register.test_strategy "password" = "live"
  }
}

Resolution: <Operation>.<input>.test_strategy wins over <Entity>.<field>.test_strategy, which broadcasts to every operation whose input shares the field name. The validator rejects unknown field qualifiers, missing qualifiers, values other than "live" / "redacted", conflicting entity-field overrides for the same field name across different entities (would resolve ambiguously), and duplicate target+qualifier+property rules.

live => unwrap (no redact, no placeholder; emits the regular constrained strategy). redacted => force st.just("***REDACTED***") regardless of constraints. Bare, unconstrained sensitive types under redacted may fail server-side input validation (the placeholder is short); pair with a constrained type alias if your invariants demand a length/regex.

Coverage on canonical fixtures

ExprToPython covers every Expr constructor in the IR, no translator coverage gaps remain. The remaining ExprPy.Skip outputs split into (i) user errors in the spec (unbound identifier, undeclared function, wrong arity, reserved-name binding), (ii) typed Quantifier bindings, and (iii) parser-side gaps (currently the multi-clause let ... in body in ecommerce). These numbers are locked in SkipRateProbeTest.scala; CI fails if they regress.

Fixture	Total clauses	Translated	Skipped	Skip rate	Remaining-skip reason
`safe_counter.spec`	5	5	0	0.0%	,
`url_shortener.spec`	21	21	0	0.0%	,
`todo_list.spec`	50	50	0	0.0%	,
`edge_cases.spec`	29	29	0	0.0%	,
`ecommerce.spec`	76	74	2	2.6%	parser scope leak, multi-clause `let ... in` body, follow-up
`auth_service.spec`	40	34	6	15.0%	spec calls `hash()` and `recentFailedAttempts()` without declaring them as `function`/`predicate`; treated as user errors

What `_testgen_skips.json` looks like

{
  "version": 1,
  "service": "UrlShortener",
  "strategies_skipped": [],
  "behavioral_skipped": [
    {"operation": "Shorten", "kind": "requires[0]", "reason": "M5.1: only `<input> in <state>` and `<state>[input].<enum-field> = <value>` requires patterns get negative tests"},
    {"operation": "Resolve", "kind": "ensures", "reason": "state-dependent precondition; positive ensures/invariant tests are covered by stateful tests (M5.2) for ops bundled into the state machine; the single-shot behavioral test would need explicit pre-seeding"},
    {"operation": "Resolve", "kind": "invariant[allURLsValid]", "reason": "state-dependent precondition; positive ensures/invariant tests are covered by stateful tests (M5.2) for ops bundled into the state machine; the single-shot behavioral test would need explicit pre-seeding"}
  ],
  "structural_skipped": [
    {"operation": "Shorten", "kind": "structural_ensures[0]", "reason": "ensures references pre()/prime(), covered by behavioral/stateful layers"},
    {"operation": "Shorten", "kind": "structural_ensures[2]", "reason": "ensures references state field, covered by stateful invariants"}
  ]
}

The behavioral_skipped array enumerates clauses that the test generator could not turn into a runnable test. Every entry signals one of: (a) state-machine setup is needed before the precondition is reachable for a non-transition operation (no via rule covers it; the remaining gap on GetTodo-shaped operations); (b) the requires shape is outside the recognized negative-test pattern; (c) a guarded transition rule whose guard sits outside the recognizer's recognized shapes (ordered comparison / enum or literal equality / not-none / conjunction of those); (d) a real user error in the spec.

Runtime requirements

The generated service must be running and reachable at SPEC_TEST_BASE_URL (default http://localhost:8000).
The service must have started with ENABLE_TEST_ADMIN=1. Without it, the conftest fixture skips the whole suite with a clear message rather than failing weirdly.
Setting SPEC_TEST_INPROC=1 flips conftest.py to use FastAPI's TestClient against an in-process import of app.main:app instead of going over HTTP to SPEC_TEST_BASE_URL. Each pytest invocation re-imports the app fresh, required for mutation testing (see Mutation Testing) where the runner has to observe edits to app/. The admin router is auto-enabled in this mode (the conftest unconditionally sets ENABLE_TEST_ADMIN=1 before importing app.main). Postgres (or whatever DATABASE_URL points at) is still required because the in-process app makes real DB calls.
By default, compile emits service stubs that raise NotImplementedError for any operation classified LLM_SYNTHESIS. Pass --with-synthesis to invoke the M6 CEGIS loop (#27) and substitute verified Dafny output into those stubs. Runtime invariant and temporal-property enforcement (#86) ships the @invariant() observers and x-temporal extensions the assertions rely on.

Custom strategies (M5.6)

Strategies.scala synthesizes a Hypothesis strategy from each TypeAliasDecl's where clause, recognised shapes are length bounds, regex (matches /.../), numeric range, and the is_valid_uri / is_valid_email predicate filters. Anything outside that set lands in _testgen_skips.json under strategies_skipped, with the synth falling back to an unconstrained base strategy (st.text(), st.integers()).

To override the synthesized strategy with a hand-written one, declare a strategy convention rule whose value is a module:symbol reference:

service AuthService {
  type Email = String where value matches /^[^@]+@[^@]+\.[^@]+$/
  type PasswordHash = String where len(value) = 64

  conventions {
    Email.strategy = "tests.strategies_user:valid_email"
    PasswordHash.strategy = "tests.strategies_user:strong_password_hash"
  }
}

Then implement the symbols in tests/strategies_user.py (stub emitted on first compile, preserved across re-compile):

from hypothesis import strategies as st

def valid_email():
    return st.from_regex(r"^[^@]+@[^@]+\.[^@]+$", fullmatch=True)

def strong_password_hash():
    return st.text(alphabet="0123456789abcdef", min_size=64, max_size=64)

The generated tests/strategies.py automatically imports the symbols and routes def strategy_<type>(): return <symbol>() to the override. Downstream test files (test_behavioral_*.py, test_stateful_*.py) keep calling strategy_<type>() unchanged, only the body of that function is swapped.

`--strict-strategies` CI gate

Pass --strict-strategies to compile --with-tests to fail the build if any type alias or enum has an incomplete synthesized strategy AND no convention override is registered. Use it as a CI guard against silently-broken strategies:

sbt "cli/run compile --with-tests --strict-strategies --out /tmp/svc fixtures/spec/auth_service.spec"
# Exit 1 if any strategy falls back to st.text() / st.integers() / st.nothing() with no override

Two synthesis gaps trip the gate:

Unhandled where constraint, e.g., type Code = String where custom_pred(value) produces st.text() plus a # testgen-skip: note. The synthesized strategy is too permissive (custom_pred is not enforced).
Unsupported base type, e.g., type Mapping = Map<String, Int> falls through to st.nothing(). The synthesized strategy is unusable.

Both cases land in _testgen_skips.json's strategies_skipped array and become compile errors under --strict-strategies.

Built-in predicates (M5.10)

isValidURI and isValidEmail are not hardcoded in Scala. They live in parser/resources/specrest/parser/preamble.spec as ordinary spec-language predicate declarations:

service Preamble {
  predicate isValidURI(s: String) = s matches /^https?:..[^\s]+/
  predicate isValidEmail(s: String) = s matches /^[^@\s]+@[^@\s]+\.[^@\s]+$/
}

The parser auto-merges this preamble's predicates into every parsed ServiceIR, so user specs can call isValidURI(x) or isValidEmail(x) without declaring them. A user-declared predicate with the same name takes precedence.

What this enables

Adding a new built-in predicate is one line in preamble.spec, no Scala edit, no JSON registry, no helper-import wiring.
The verifier sees the predicate body (a regex match), not an opaque function. Z3 still reasons about the matches_<hash> declaration as uninterpreted (same proving power as before), but the framework no longer special-cases predicate names.
Test-gen synthesises a tighter strategy for type aliases that use a preamble-style s matches /regex/ predicate: instead of st.text().filter(lambda v: is_valid_uri(v)), the generator inlines the regex and emits st.from_regex("^https?:..[^\s]+", fullmatch=True). Hypothesis can hit the property directly rather than rejection-sampling.
Adding a custom predicate in a user spec works identically: predicate isValidUUID(s: String) = s matches /^[0-9a-f-]{36}$/ and the same pipeline applies (Strategy inlines the regex, ExprToPython renders is_valid_uuid, the verifier treats it as a Bool predicate).

Limitations

The preamble's isValidURI is a regex approximation, not Python's urlparse-based check. For property-based testing this is symmetric (the strategy generates URI-like strings; the assertion checks the same regex). The spec's matches /regex/ literal cannot contain a literal /, character classes ([/]) and lexer escapes are also unsupported. Predicates that need non-regex semantics (e.g. opaque urlparse(s) strictness) would require a separate extern predicate mechanism, out of scope here.

Limitations and follow-up issues

For the live test-generation follow-up backlog (state-dependent preconditions on non-transition ops, GuardSatisfier extensions, M5.11 multi-target test gen, M5.12 default --with-tests), see Roadmap $\to$ Test generation follow-ups.

On this page