spec_to_rest
Research

Translator Soundness, Isabelle/HOL Proof Track

Edit on GitHub

Master doc for the spec_to_rest translator-soundness program. Scoping, governance, status, profile, runway, and milestone progress for proving the IR → Z3 verification path sound in Isabelle/HOL.

Last updated:

Status banner (2026-05-04): pivoted to Isabelle/HOL. The proof track migrated from Lean 4 to Isabelle/HOL via issue #193. The universal soundness theorem closes with zero sorry in Isabelle, Code_Target_Scala productionizes the extraction, and the Lean track at proofs/lean/ was retired. Per-run cert infrastructure was deleted post-pivot (vestigial after universal soundness, Option 2 of #193 review).

Most of this document is pre-pivot context, the comparative analysis in §7, the prior-art survey in §4, and the milestone decomposition in §8 still reflect the original Lean-era framing. They're kept for historical clarity. Sections updated for the post-pivot state are marked with a "post-#193" tag.

Master document for the translator-soundness program. Originally six docs (10/11/12/13/14/15); consolidated here in 2026-05-02 after the M_L.2-closure + M_L.4.a-g shipped batch. Anchors issue #88 (translator soundness), umbrella #170, and execution-chain #126#130.


Table of contents

  1. Status and Framing
  2. The Trust Chain Today
  3. Why "Reconstruct Z3 Proofs" Does Not Work in 2026
  4. Prior Art: 2024-2026 Snapshot
  5. Two Paths and Why We Recommend Both, Sequenced
  6. The Verified Subset
  7. Picking a Proof Assistant
  8. Milestone Decomposition
  9. Risks
  10. Non-Goals
  11. References
  12. Governance, proof-owned surfaces and change process
  13. Live status ledger
  14. Proof-Safe Profile and Backend Contract
  15. Runway and Stall Policy
  16. Activation Record and Kickoff Shape

1. Status and framing

Issue #88 asks for a mechanically checked correctness proof of spec_to_rest's spec → IR → Z3 verification path. The issue itself flags this as unscheduled, research-flavored, easily one person-year of work, filed primarily to give the capability a concrete home so it isn't absorbed into other milestones.

This doc updates that framing for 2026:

  • The Z3-proof-replay path that #77 partly assumed has not materialized. Z3 4.13 (the version this project pins via z3-turnkey) emits only its undocumented 2008-era natural-deduction term tree; :proof_format alethe was never shipped in any Z3 release, and quantifier instantiations remain opaque (see §3).
  • The closest published prior art (Cohen, Princeton PhD, 2025) is a 5-person-year effort that verifies five Why3 IVL transformations in Coq, and even Cohen left monomorphization, the SMT-LIB printer, and end-to-end SMT soundness in the trusted computing base (TCB).
  • A cheaper, more recent template exists: per-run translation validation à la Parthasarathy et al. (PLDI 2024 / POPL 2025) emits an Isabelle proof certificate for each verifier run showing "if the IVL output is correct then the source is correct." This is the realistic 2026 path for a project our size.

The recommendation in this doc is therefore: decompose #88 into a tractable five-milestone plan (M_L.0 → M_L.4) that ships translation validation first, then moves toward meta-soundness only for the verified subset (§6) once a contributor signs up for the M_L.2 commitment.

As of M_G.0, the first honest theorem target is the in-memory ServiceIR → Z3Script path used by Consistency.runConsistencyChecks, rather than the optional SmtLib.scala exporter. The exporter stays outside the first ship claim until a later milestone.

Status: execution track shipped. M_G.3 committed the runway, M_G.4 activated M_L.*, and the post-pivot Isabelle/HOL track delivered through Phase 7 (see STATUS.md). Issue #88 closed 2026-04-26 with all M_L sub-issues (#126#130) resolved. The universal soundness theorem closes with zero sorry, Code_Target_Scala produces the production translator at ir/generated/SpecRestGenerated.scala, and the A8RoundTripOracleTest exercises every in-subset shape end-to-end. Per-run certificate emission was deleted post-pivot (vestigial after universal soundness, Option 2 of #193 review).


2. The trust chain today

Today, "we proved this spec correct" is shorthand for the following chain:

English-prose semanticsdocs/content/docs/spec-language.mdx Parser CSTmodules/parser/.../Parse.scala IR Buildermodules/parser/.../Builder.scala Typed IRmodules/ir/.../Types.scala (25 Expr cases) Z3 Translatormodules/verify/.../z3/Translator.scala (1917 LOC) Alloy Translatormodules/verify/.../alloy/Translator.scala (429 LOC) Z3 4.13via z3-turnkey JNI sat / unsat / unknown

Each link is a potential silent-failure point:

LinkFailure mode
Prose semantics → IRSpec language has only English-prose semantics; nothing to refine the IR builder against.
IR builderHand-written; tested via fixtures, rather than proven.
Z3 Translator1917 LOC of Scala. 13 of 25 Expr cases are fully translated, 8 partial, 4 raise TranslatorError. The encoding choices for entities (uninterpreted sort + field functions), state (pre/post functions), and quantifier domains are defensible but unverified (see codebase-analysis appendix below).
Z3 itselfHas had soundness CVEs historically. Pinning at 4.13 deters silent verdict flips on upgrade but does not eliminate the trust assumption.

Mechanically verifying every link is far beyond a project our size. The smallest useful target is the IR → Z3 step: it's the largest hand-written piece, the one whose semantic behaviour we control, and the one closest to user-visible verdicts.


3. Why "reconstruct Z3 proofs" does not work in 2026

Issue #77 (closed) sketched a "verify-as-gate" path that gestured at proof export and Alethe-via-Z3 as a future direction. Research as of April 2026 says that path is blocked at the Z3 side:

3.1 Z3 has no alethe export

Direct search of the Z3 release notes finds no mention of "alethe" in any release. Issue search alethe repo:Z3Prover/z3 returns zero results. The release notes mention only:

  • 4.11.2: "change proof logging format for the new core to use SMTLIB commands. The format was so far an extension of DRAT used by SAT solvers"
  • 4.12.0: "sat.smt.proof.check_rup ... apply forward RUP proof checking"

Z3 4.13 (the version pinned by z3-turnkey % 4.13.0.1 in build.sbt) emits only its undocumented IWIL 2008 natural-deduction proof format. Quantifier instantiation steps (quant-inst) appear with no machine-readable witness justification, the very steps that dominate spec_to_rest's preservation checks (5 invariants × 10 ops = 50 quantifier scopes per service).

3.2 cvc5-Alethe does not cover datatypes

The cvc5 Alethe documentation says verbatim:

Currently, the theories of equality with uninterpreted functions, linear arithmetic, bit-vectors and parts of the theory of strings (with or without quantifiers) are supported in cvc5's Alethe proofs.

Datatypes are not in this list. spec_to_rest's IR translator emits declare-datatypes for entity records, sums, and option types (see codebase analysis). A cvc5-as-proof-certifier backend therefore requires re-engineering the SMT encoding to be datatype-free (records as parallel UF arrays, sums as tag+payload via UF), a non-trivial rewrite, not a flag flip.

3.3 Quantifier-instantiation proof bloat

E-matching with non-trivial trigger sets routinely yields 10³–10⁵ ground instances per quantifier (see Reynolds, SMT 2023 and DSLab "Conjecture Regarding SMT Instability"). For a typical preservation suite, expect Alethe proof files in the tens to hundreds of MB, with Carcara check times in minutes. This is workable for one-off audits, prohibitive for CI-on-every-PR.

3.4 What this means

The "Z3 emits a checkable proof + Carcara/ITP replays it" architecture from #77 is not viable in 2026. Three implications:

  1. Translator soundness must be proven as a meta-theorem about our translate function, rather than by replaying each Z3 run's proof.
  2. The Z3 verdict remains an oracle in our trust base. This is the same posture as F*, Dafny, Verus, and Why3-O, none of them check Z3 proofs; they verify the encoder.
  3. An "external solver agreement" CI job is still useful as cheap defense in depth: emit our SMT-LIB, run cvc5 in parallel, alert on disagreement. Doesn't need proof export. Belongs in #77 follow-up, rather than here.

4. Prior art: 2024-2026 Snapshot

A 12-month survey produced the following landscape. Each entry tagged with how applicable it is to spec_to_rest's IR→Z3 translator.

4.1 Cohen, "a foundationally verified intermediate verification language" (princeton phd, 2025)

Closest match to #88. Defines Why3's P-FOLDR (Polymorphic First-Order Logic with Datatypes and Recursion) in Coq, gives it a denotational semantics, and proves five Why3 transformations sound:

  • eliminate_definition (recursive funs → unfolding axioms)
  • elimination of inductive predicates
  • compile_match (the first machine-checked pattern-matching compiler)
  • ADT axiomatization
  • a few smaller passes

Five person-years, single PhD student plus advisor and Sandia mentor. Coq 8.20, extracted to OCaml as Why3-O (plug-compatible with the real Why3 OCaml API).

What stays in the TCB even after this PhD: monomorphization, the SMT-LIB printer, the SMT solver itself.

Pitfalls Cohen called out, all of which apply to us:

  • Well-typedness preservation under context-modifying transformations is harder than soundness.
  • Pattern-matching compilation termination needs a non-obvious well-founded measure.
  • Mixed record-inductive types fight Coq's positivity checker.
  • ADT axiomatization (with non-uniform constructors and metadata) is the single hardest transformation.
  • The semantics layer (the formal language definition) absorbed more time than the translator transformations.

Applicability to spec_to_rest: the same architectural skeleton transfers, we'd build a deep-embedded Expr ADT in Lean/Coq/Isabelle, a denotational semantics, and prove eval e = smtEval (translate e) for our subset. We would not attempt to match Cohen's depth (polymorphism, recursive funs, inductive preds); the verified subset (§6) is intentionally smaller.

4.2 Parthasarathy et al., "towards trustworthy automated program verifiers" (PLDI 2024)

Different design: instead of proving the language-to-IVL translator sound once and for all, emit an Isabelle proof certificate for each verifier run that shows "if the IVL program is correct then the source program is correct." arXiv 2404.03614.

POPL 2025 sequel (Formal Foundations for Translational Separation Logic Verifiers) extends to Viper-style separation logic.

Why this matters: per-run translation validation is dramatically cheaper than per-language meta-soundness, no need to formalize the entire source language semantics, only enough to express each instance. The certificate kernel can be small (hundreds of LOC), checkable quickly, and updates automatically as the translator evolves.

Trade-off: certificates cover only inputs we've actually translated. Meta- soundness covers the universe of well-typed inputs.

4.3 Lean-smt and Isabelle/HOL alethe reconstruction

lean-smt (CAV 2025): a Lean 4 tactic that translates a Lean goal to SMT-LIB, hands it to cvc5 (not Z3), and replays the resulting Alethe proof in the Lean kernel. ~71% of cvc5 proofs reconstruct (15,271 of 21,595 benchmarks); 98% of successful reconstructions complete in <1s.

Isabelle Alethe pipeline (ITP 2025): extends Sledgehammer's veriT-only reconstruction (Schurr/Fleury, 2019) to also support cvc5. Five years on, both veriT and cvc5 reconstruct in Isabelle/HOL.

Wrong direction for us. These tools answer: given a goal in Lean/Isabelle, can we discharge it via SMT and replay the proof? Our question is the inverse: given our hand-written translator from spec-IR to SMT-LIB, is it sound? lean-smt and Sledgehammer would only help us as authoring tools for the meta-soundness proof, we can use them to discharge sub-lemmas, rather than to validate the translator.

Z3 is also absent from both pipelines: Sledgehammer's Z3 oracle uses the legacy proof format which is unmaintained, and lean-smt is cvc5-only.

4.4 Verified VCG for Dafny (nezamabadi/myreen/tan, dec 2025)

arXiv 2512.05262. Big-step semantics for an imperative Dafny subset (mutually-recursive methods, while loops, arrays, no records, sets, quantifiers, or partial functions) plus a verified VCG plus a verified compiler to CakeML. HOL4. The VCG produces verification conditions in HOL4's logic, not SMT-LIB, they short-circuit the SMT step entirely.

Useful as a template for the VCG side (preservation obligations as HOL/Lean propositions), less useful for the SMT-encoding side because they don't generate SMT.

4.5 F* → SMT encoding (aguirre/hriţcu, 2016 → ongoing)

Negative result. Towards a Provably Correct Encoding from F* to SMT formalized a fragment in Coq; ten years later no completion. Reason: refinement types + monadic effects too rich for the formalization to remain tractable.

Lesson for us: keep the source language rigidly small. Resist scope creep into the rich corners of the IR (lambdas, set comprehension, the-operator) until the simple core works.

4.6 No mechanized Alloy or Dafny surface semantics

Search did not surface a Coq/Isabelle/Lean formalization of Alloy's relational semantics (Daniel Jackson, Software Abstractions) nor of Dafny's reference manual. Astra (UWaterloo, 2019) evaluated Alloy→SMT-LIB empirically but did not prove it. The most actively developed comparable formalization is TLA* in Isabelle/HOL by Grov & Merz (AFP, 2011-2025), shallow-embedded, covers the temporal layer, rather than a translator to SMT.

Lesson: any spec language semantics we mechanize is novel work.

4.7 Older prior art worth knowing

  • CompCert (~100 kLOC Coq, 6 person-years) sets the upper bound on this kind of work and is the architectural inspiration.
  • AliveInLean (CAV 2019) verifies a peephole-optimization checker for LLVM in Lean 4, closest to "verified SMT encoder" in the LLVM space, but limited to bit-vector and array peepholes.
  • HOL-Boogie (TPHOLs 2008) embeds Boogie's output into Isabelle/HOL for interactive VC discharge, does not verify the translator itself.
  • SMTCoq (smtcoq.github.io) verifies proofs returned by SMT solvers (veriT/CVC4/cvc5), rather than encodings into them. Quantifier-free. Different problem.

4.8 Closest-prior-art summary table

ProjectSourceProverWhat they provedPitfall
Cohen, Why3-in-Coq (2025)Why3 P-FOLDRCoq5 IVL transformations sound5 person-years; monomorphization & SMT-LIB printer in TCB
Nezamabadi et al., Dafny VCG (2025)Dafny subsetHOL4VCG sound; CakeML compiler correctBypasses SMT; no records/sets/quantifiers
Parthasarathy et al., Trustworthy Verifiers (2024-25)ViperIsabellePer-run forward-simulation certCert covers only translated inputs, rather than all inputs
Aguirre/Hriţcu, F*→SMT (2016-)F* fragmentCoqSoundness of fragment encodingStalled 10+ years; F* too rich
AliveInLean (2019)LLVM IR peepholesLean 4Encoder verifiedBV/array only
Grov/Merz TLA* in Isabelle (2011-25)TLA*IsabelleEmbedding + derived rulesNo translator to SMT

5. Two paths and why we recommend both, sequenced

Per-run certificates following Parthasarathy 2024. For each verifier run, emit a proof object (in Lean/Isabelle/Rocq) showing that for this specific IR, the SMT output is a correct encoding.

Cost: ~3-6 person-months for the certificate kernel + emission glue. The certificate kernel is small (hundreds of LOC) and stable; the per-run emission is mechanical (one proof step per translate case).

Coverage: only inputs we've actually translated.

Trust assumption: the certificate-checking kernel + its embedding of the semantic domain.

Why first: ships verifiable trust improvement in months, rather than years. Acts as a forcing function for documenting the IR's intended semantics (which doesn't exist in any machine-checkable form today).

5.2 Path b: Meta-soundness (expensive, optional follow-up)

Cohen-style. Deep-embed Expr and TypeExpr in a proof assistant; define a denotational semantics; define translate as a function in the proof assistant; prove denote_smt(translate(e)) = denote_ir(e) for every well-typed e in the verified subset.

Cost: 6-12 person-months for the verified subset (§6) at expert rates; double-to-triple at non-expert rates. Per A5's analysis: ~1,350 LOC for the semantics layer (M_L.1) and ~3-5× that for the soundness theorem itself (M_L.2).

Coverage: all well-typed inputs in the verified subset.

Trust assumption: the proof assistant's kernel + the embedding's accuracy (deep Expr, shallow semantic domain, see §7.2).

Why deferred: only worth doing if a contributor signs up for the multi-month commitment, and only after Path A has documented the IR semantics rigorously enough to enable it. Without Path A, M_L.1 is starting from scratch on the semantics; with Path A, M_L.1 reuses a Path-A-validated semantic skeleton.

5.3 Why both, sequenced

Path A's certificate kernel is Path B's meta-soundness skeleton, scoped to a single input. Building A first means:

  1. Faster initial trust improvement (months, rather than years).
  2. Forcing-function for writing the IR semantics in a machine-checkable form.
  3. The artifact built in A is reusable: M_L.2's "translation soundness" is the universal-quantifier version of A's per-input certificate.

The alternative, start with B, risks Cohen's outcome at smaller scale: 9 months into the semantics layer with no end-to-end deliverable.


6. The verified subset

Anchored to the codebase analysis. Per April 2026, Translator.scala covers 13/25 Expr cases fully, 8 partially, 4 with TranslatorError. The verified subset for M_L.1 picks the smallest set that exercises the four core SMT pillars:

  • Boolean reasoning (∧, ∨, ⇒, ¬)
  • Linear integer arithmetic (=, <)
  • Uninterpreted predicates (membership in a state relation)
  • Bounded quantification (over enums; over fixed-size entity collections)

6.1 Operators in scope (M_L.1)

Expr caseWhy included
BinaryOp(And, _, _)Boolean conjunction
BinaryOp(Or, _, _)Boolean disjunction
BinaryOp(Implies, _, _)Required for invariants and ensures
BinaryOp(Eq, _, _)On Int and on entity-typed values
BinaryOp(Lt, _, _)LIA comparison
BinaryOp(In, _, _)Membership in a state relation domain
UnaryOp(Not, _)Boolean negation
UnaryOp(Negate, _)LIA negation (0 - x)
Quantifier(All, _, _) over enumsUniversal binding
Let(_, _, _)Local binding
IntLit, BoolLit, IdentifierAtoms

Plus the IR top-level shells:

  • EntityDecl (flat, no inheritance, no per-field constraints)
  • EnumDecl
  • StateDecl with scalar fields and simple domain-typed relations only
  • OperationDecl with requires and ensures (no state mutation in M_L.1; M_L.2 adds Prime/Pre)
  • InvariantDecl (single conjunctive predicate)

6.2 Operators explicitly out of scope (deferred)

ExcludedRationale
BinaryOp(Add|Sub|Mul|Div, ...)Defer to M_L.2; need carrier-set proof for Int
BinaryOp(Subset, ...)State-relation identifier subset is in subset; arbitrary set subset remains deferred
BinaryOp(Union|Intersect|Diff, ...)Closed for set-valued expressions in issue #195
UnaryOp(Cardinality)Currently only on state relations; defer to state-mutation milestone
UnaryOp(Power)Translator already raises TranslatorError (undecidable in FOL), permanent exclusion
Quantifier(_, _, _) over entity collectionsDefer to M_L.2 (needs frame axioms)
SetComprehension, MapLiteral, SeqLiteralOut of scope until collections milestone
SetLiteralClosed for finite literals in issue #195
If, Lambda, Constructor, SomeWrap, The, NoneLitTranslator already raises TranslatorError for most
Index, Call, EnumAccess (dynamic), With, MatchesDefer; require advanced encoding
Prime, PreM_L.2 adds two-state coupling
FieldAccessM_L.2 (records subsumed by entity decls)
TransitionDecl, TemporalDecl, FunctionDecl, PredicateDeclOut of scope; lives in separate temporal/derived-logic milestones

This subset hits ~10 operators, ~20 LOC of Lean per operator for the denotation, and admits an automatic-decidability story (every quantifier in scope is over a finite domain, so eval returns Bool instead of Option Bool for many branches).

6.3 Sort encoding in M_L.1's translator

Mirroring Translator.scala choices, simplified:

  • Bool, Int: SMT-LIB native sorts.
  • Enums: uninterpreted sort + member constants + distinctness axiom. Cardinality finite and known.
  • Entities: uninterpreted sort + per-field accessor function (Entity_field : Entity → FieldSort). No datatype constructors in M_L.1 (avoids cvc5-Alethe blocker; matches Cohen's "ADT axiomatization" pattern).
  • State: tuple of scalar functions (no maps, no relations beyond domain membership predicates) in M_L.1. M_L.2 expands.

7. Picking a proof assistant

7.1 Three candidates compared

CriterionLean 4 + mathlib4Isabelle/HOL + AFPCoq/Rocq
Z3 reconstruction available?No (lean-smt is cvc5-only)No (Sledgehammer Z3 is legacy; ITP 2025 work targets cvc5/veriT)No (SMTCoq targets veriT/cvc5)
Closest prior-art languageAliveInLean (LLVM peepholes)TLA*, Z, Object-Z, Cohen's framework if portedCohen's Why3-O (the Coq variant)
Toolchain churnQuarterly (Lean 4.27→4.28→4.29→4.30 in 14 weeks Feb-Apr 2026)Yearly Isabelle releases; AFP push-throughYearly Coq → Rocq transition; stable post-2025
Long-lived single-author projectsFew (ecosystem <5 years)CryptHOL (9 yrs), TLA* (14 yrs), HOL-Z (25 yrs)CompCert (20 yrs)
Scala-team learning curveLowest (most syntactically similar to Scala 3)Medium (Isar prose-style proofs unfamiliar)Medium-high (tactic style, ssreflect)
Mathlib4 record / Finset supportExcellent (structure, Finset α, decidability)Excellent (record, Set, fset library)Good (Record, MSet, but more boilerplate)
Risk of mathlib churn breaking proofsHighLow (definitional shallow embeddings rarely break)Low

7.2 Decision: Isabelle/HOL (post-#193 pivot)

Active proof assistant: Isabelle/HOL (Isabelle2025-2). The original recommendation was Lean 4 (see §7.4 historical note); the project pivoted to Isabelle/HOL on 2026-05-04 after the M_L.0-M_L.5 first-ship gate landed. Tracking issue: #193.

Why Isabelle won the rematch:

  1. Production-grade Scala extraction: Code_Target_Scala (in HOL-Library) ships Lean's missing piece. The verified translate, eval, and smt_eval functions extract directly to ~1.4 kLoC of idiomatic Scala 3. The Scala layer's translator is no longer hand-written, it is the extracted Isabelle definition.
  2. Toolchain stability: Isabelle releases yearly with AFP push-through migration. Lean 4 ran 4.27 → 4.30 in 14 weeks Feb-Apr 2026, quarterly churn is expensive at single-author scale.
  3. Stronger automation on the proof side: the universal soundness theorem closes in ~600 LoC of Isabelle (apply (cases ...; auto) + per-case *_step lemmas) vs ~5400 LoC of Lean's case-bashing.
  4. Cleaner TCB: Isabelle kernel + Z3 driver + extracted-Scala output. No Lean.ofReduceBool axiom, no cert_decide per-run native compilation.

Active configuration:

  • lean-toolchain retired; proofs/isabelle/SpecRest/lean-toolchain doesn't exist. Isabelle pin lives at the host level (Isabelle2025-2 installed via the system's isabelle binary).
  • No mathlib analog: imports are restricted to Main + HOL-Library. AFP entries are only imported when load-bearing (none currently).
  • Records use [code]-marked defs to bypass the polymorphic-scheme codegen barrier observed during Phase 5 of #193.

7.4 Historical: Lean 4 was the original choice

The pre-pivot recommendation cited Lean 4's smallest learning-curve gap from Scala 3 and the AliveInLean precedent. Both arguments still hold; they were outweighed by the lack of a production Scala extractor in Lean's ecosystem and the quarterly toolchain churn. The Lean track at proofs/lean/ was retired with the pivot, see #193 for the full decision record and #202 for the related (deferred) IR-canonicalization follow-up.

7.3 Embedding shape (locked across all three candidates)

  • Source IR (Expr, TypeExpr, declarations): deep embedding as inductive types mirroring modules/ir/.../Types.scala 1:1. Required because the soundness theorem ∀ e. denote(translate(e)) = eval(e) quantifies over e syntactically.
  • Semantic domain: shallow. Int → Lean Int, Bool → Lean Bool, Set α → Mathlib Finset α (or core List if we sidestep mathlib), entity sorts as opaque type variables.
  • SMT-LIB target: shallow. Interpret SMT terms directly as Prop, no need for meta-reasoning over SMT syntax, only over IR syntax.

This is the standard hybrid Why3-in-Coq, AliveInLean, and Concrete Semantics IMP all use (Gibbons & Wu; Annenkov & Spitters).


8. Milestone decomposition

Five milestones. Each has a contributor sign-off gate before the next opens.

8.1 M_L.0, scope, scaffolding, contributor handoff

Effort: ~1 week part-time. Lands when a contributor signs up for M_L.1.

Deliverables:

  • This research doc (you're reading it).
  • A proofs/lean/ directory with a minimal lakefile.toml, an empty SpecRest namespace, and a one-page README.md linking back here.
  • A pinned lean-toolchain (recommend latest stable Lean 4 release as of M_L.0 start, e.g., leanprover/lean4:v4.30.0).
  • A .github/workflows/lean.yml job that runs lake build. Off the critical path (separate matrix); failure does not block PRs.
  • An initial PR-template note that anyone touching modules/ir/.../Types.scala's Expr ADT should add a TODO entry to proofs/lean/SpecRest/IR.lean.todo.

Acceptance: empty Lake project compiles green in CI.

8.2 M_L.1, IR denotational semantics for the verified subset

Effort: 6-10 person-weeks at expert rates; 12-20 at non-expert.

Deliverables:

  • proofs/lean/SpecRest/IR.lean: deep Expr, TypeExpr, EntityDecl, EnumDecl, StateDecl, OperationDecl, InvariantDecl, restricted to the §6.1 subset.
  • proofs/lean/SpecRest/Semantics.lean: Value ADT, Env, State, eval : Env → State → Expr → Option Value. Per-operator denotation lemmas.
  • proofs/lean/SpecRest/Examples.lean: round-trip tests for safe_counter.spec fragments (parsed by hand for now; M_L.4 wires the parser).

Acceptance:

  • All denotation lemmas closed (no sorry).
  • safe_counter.spec invariant count ≥ 0 evaluates to True under a hand-built initial state.

LOC estimate (per A5 calibration, anchored to Concrete Semantics IMP):

ComponentLean LOC
Inductive Expr + TypeExpr80
Value + Env, State80
eval core150
Quantifier + decidability120
OperationDecl + InvariantDecl80
Per-operator denotation lemmas (10 × ~30)300
Round-trip examples100
Total~900

(A5 reported ~1,350 with collections+records; M_L.1 trims those, dropping ~450 LOC.)

8.3 M_L.2, translator soundness theorem for the verified subset

Effort: 3-5× M_L.1, i.e. ~6-12 person-months.

Deliverables:

  • proofs/lean/SpecRest/Smt.lean: shallow embedding of the SMT-LIB fragment we emit (Bool ops, LIA, UF, bounded quantifiers).
  • proofs/lean/SpecRest/Translate.lean: Lean version of the Scala translator, function-by-function mirror of the §6.1 cases in z3.Translator.scala.
  • proofs/lean/SpecRest/Soundness.lean: theorem ∀ e, well_typed e → eval e = smtEval (translate e).
  • An audit appendix in proofs/lean/README.md mapping each Lean translation case to the corresponding Scala line range for human cross-checking.

Acceptance:

  • Soundness.lean closes with no sorry for the §6.1 subset.
  • A Scala test (modules/verify/.../TranslatorAuditTest.scala) checks that any Expr case marked "in the verified subset" still has a counterpart in Translate.lean.

Trust assumption (after M_L.2):

  • Lean's kernel.
  • The shallow SMT-LIB embedding's accuracy (auditable in <100 LOC).
  • The hand-written Scala translator matches the Lean translate function case-for-case (audited; if it diverges the Scala test fires).
  • Z3 itself.

8.4 M_L.3, translation validation (per-run certificate emission)

Effort: ~3-6 person-months, parallel to M_L.2.

Deliverables:

  • modules/verify/src/main/scala/specrest/verify/cert/Emit.scala: emits a Lean source file <spec>.cert.lean per verify run, containing one theorem cert_<id> : eval ir = smtEval smt_ir := by ... per check.
  • proofs/lean/SpecRest/Cert.lean: tactic library (or simp set) that closes a cert_<id> goal via a fixed proof script when the IR is in the verified subset.
  • verify --emit-cert <dir> CLI flag, mirroring the existing --dump-vc.

Acceptance:

  • For every fixture in fixtures/spec/ whose Expr is in the verified subset, verify --emit-cert /tmp/c && lake build /tmp/c succeeds.
  • Out-of-subset fixtures emit a cert_<id> := sorry placeholder with a comment pointing at the offending case.

Why parallel to M_L.2: M_L.3 only needs M_L.1's semantics, not M_L.2's universal-quantifier theorem. M_L.3 ships earlier trust gain; M_L.2 generalizes it.

8.5 M_L.4, subset expansion

Open-ended. Add Add/Sub/Mul/Div, then collections, then state mutation, then quantifiers over entity collections. Each expansion follows the M_L.1 + M_L.2 template: extend the deep IR, extend eval, extend translate, extend the soundness theorem.

Sequencing recommendation:

  1. LIA arithmetic (Add, Sub, Mul, Div), needed by ~70% of real specs.
  2. State mutation (Prime, Pre, OperationDecl with state updates).
  3. Quantifiers over fixed-size entity collections (with frame axioms).
  4. Finite collections (Set, Map, Seq with bounded ops).
  5. Records / With updates / FieldAccess.
  6. Constructor, EnumAccess (dynamic).

Each item is its own multi-month milestone; punt indefinitely if the §5.1 path-A certificates make it unnecessary for any real consumer.


9. Risks

9.1 Doc rot

This doc anchors to modules/ir/.../Types.scala's 25-case Expr ADT. If Expr evolves (new cases, renamed cases, restructured sums), the verified subset table in §6 drifts. Mitigation: a PR-template note (added in M_L.0) reminding contributors to update §6 when touching Expr. A lint pass in M_L.2 flags Expr cases without a corresponding proofs/lean/SpecRest/IR.lean mirror.

9.2 Z3 verdict drift across versions

Path B (meta-soundness) does not bind us to a Z3 version, the soundness theorem says "if Z3 returns unsat, the IR property holds." Path A (per-run certs) is similarly version-agnostic. However, an "external solver agreement" CI job (running cvc5 in parallel and alerting on disagreement) is cheap defense in depth and belongs in a #77 follow-up. Out of scope here.

9.3 Lean toolchain churn

Lean shipped 4.27, 4.28, 4.29, 4.30 in 14 weeks (Feb-Apr 2026). Each release can break dependent projects. Mitigation: pin lean-toolchain; bump only quarterly; avoid mathlib4 dependency (M_L.1 needs none). The "every 6-12 months, 1 day to bump" maintenance cost is acceptable; the "every 3 weeks, scramble because mathlib reorganized" cost is not.

9.4 Contributor abandonment mid-milestone

The work is part-time, opportunistic. A contributor disappearing mid-M_L.1 strands ~500 LOC of half-finished Lean. Mitigation:

  • M_L.0 only opens the scaffolding when a contributor commits to M_L.1.
  • M_L.1 is broken into 8 phases (per A5 §4) each independently mergeable.
  • A proofs/lean/STATUS.md file tracks per-operator completion so a successor can resume.

9.5 The semantics-layer trap (cohen)

Cohen's PhD spent more time on the semantics layer than on the translator transformations. We mitigate by deliberately picking the smallest non-trivial subset (§6.1) and resisting expansion until M_L.1 + M_L.2 ship. A "verified subset" at 10 operators is more valuable than a half-finished verified subset at 25 operators.

9.6 The f* trap

F*'s SMT encoder has been "open" for 10+ years. Their failure mode: refinement types + monadic effects too rich for the formalization to remain tractable. We avoid this by excluding rich corners of the IR upfront (§6.2 lists every deferred operator with a why) and by treating TranslatorError-raising operators as permanent exclusions (Power, Lambda, If, see §6.2).

9.7 Mismatch between lean translate and Scala translator.scala

Path B proves the Lean translate function sound. Production uses Scala Translator.scala. Mitigation: M_L.2 ships an audit appendix mapping each Lean case to the corresponding Scala line range; TranslatorAuditTest.scala machine-checks that the verified subset's case set in Lean equals that in Scala. If they diverge (Scala adds a case Lean lacks, or vice versa), the test fires.

The full "extract Scala from Lean" approach (Why3-O, Cohen 2025) is out of scope, it would more than triple the M_L.2 cost. We accept the audit-by-test mitigation as good-enough.


10. Non-goals

Inherits #88's non-goals, plus:

  • Verifying the Alloy translator. Same approach, separate effort. Alloy semantics in any prover does not exist, would be novel work twice over.
  • Verifying the parser. Treat the parser CST as given; this work targets IR → solver only.
  • Verifying Z3 itself. Z3 remains in the trust base. SMTCoq-style proof replay is blocked at the Z3 side (see §3) and a switch to cvc5 is a separate multi-month rewrite (§3.2).
  • Verifying the SMT-LIB serializer (SmtLib.scala). Trivially small, audited by inspection, in the TCB. Cohen made the same call.
  • Ahead-of-time correctness for every spec construct. Aim for a verified subset that grows; full coverage is M_L.4-and-beyond, possibly never.
  • Human-readable proof narration of soundness violations. That's #20's territory ("operation X violates invariant Y because…"). Soundness proof failures should not happen in production after M_L.2; if they do, the bug is in Lean, rather than in user output.

11. References

11.1 Closest prior art

11.2 SMT proof reconstruction in proof assistants

11.3 Z3 proof formats

11.4 Spec-language semantics formalizations

11.5 Embedding-style references

11.6 Lean 4 / mathlib4 ecosystem

11.7 Spec_to_rest cross-references


A. codebase translator coverage (april 2026)

Snapshot (April 2026) of verify/z3/Translator.scala (1917 LOC) measured against the 25-case Expr ADT that lived in the (now-deleted) hand-written modules/ir/src/main/scala/specrest/ir/Types.scala. Used to define the verified subset in §6. Post-#202 the canonical IR is the extracted expr_full (27 ctors) in ir/generated/SpecRestGenerated.scala, with the verified subset surfaced as expr in the same file; this appendix is preserved as a historical snapshot of the pre-canonicalization translator coverage.

Expr caseTranslator statusNotes
BinaryOp(And|Or|Implies|Iff)fulldirect mapping (lines 629-633)
BinaryOp(Eq|Neq)fulldom/set-comprehension equality special-cased (616-621), fallback Z3Expr.Cmp (643)
BinaryOp(Lt|Gt|Le|Ge)fullCmpOp mapping (634-643)
BinaryOp(In|NotIn)fullmembership via state relation dom, set literal, set membership (622-757)
BinaryOp(Subset|Union|Intersect|Diff)fullwith sort validation; fails on non-SetOf (661-693)
BinaryOp(Add|Sub|Mul|Div)partialintegers only; fails on string/set arithmetic (650-653)
UnaryOp(Not)fulldirect Z3Expr.Not (866)
UnaryOp(Negate)fullas 0 - operand (867-868)
UnaryOp(Cardinality)partialonly on state-relation identifiers (876-881)
UnaryOp(Power)errored"powerset operator is not decidable in first-order SMT" (870-874)
Quantifierfull∀/∃ supported; ∄ encoded as ¬∃ (905-930)
SomeWraperroredcatchall (597-601)
Theerroredcatchall (597-601)
FieldAccessfullvia entity field functions (971-995)
EnumAccessfullvia enum member constants (1132-1142)
Indexpartialonly on state-relation references (997-1009)
Callpartialonly identifier callee; hardcoded builtins (len, isValidURI); higher-order fails (1011-1032)
Prime / Prefullstate-mode switching for post/pre-state (589-592)
Withfullrecord update via Skolem constant + equality constraints (1061-1098)
SetComprehensionerroredonly allowed inside membership; standalone fails (1100-1105)
SetLiteralpartialnon-empty literals; empty literals require type context and are covered in membership contexts
MapLiteralerroredcatchall (597-601)
SeqLiteralerroredcatchall (597-601)
Matchesfullas uninterpreted predicate, mangled by pattern + arg sort (1037-1049)
IntLit, BoolLit, StringLitfulldirect literals or constants (577-579)
NoneLiterroredcatchall (597-601)
Constructorerroredcatchall (597-601)
Iferroredcatchall (597-601)
Lambdaerroredcatchall (597-601)
Letfullenvironment extension (1051-1059)

Summary: 13 fully handled, 8 partial, 4 errored. The verified subset (§6.1) draws exclusively from the "fully handled" column, restricted further to operators whose Lean denotation is <30 LOC each.

Sort-encoding decisions mirrored in M_L.1's IR.lean:

  • Entities: uninterpreted sort U:EntityName + accessor functions EntityName_fieldName : EntityName → FieldSort (319-328).
  • Enums: uninterpreted sort + member constants + distinctness axioms (278-283).
  • Options: flattened to inner type (OptionType(T)T, line 541).
  • Sets: SMT-LIB (Set elemSort) with store/select (SmtLib.scala:66-72).
  • Strings: uninterpreted sort + fresh constants per literal + distinctness (529-537).
  • State relations (RelationType/MapType): pair of functions <field>_dom : K → Bool and <field>_map : K → V (465-495).
  • Scalar state fields: single constant function state_<field> : () → T (497-501).
  • Post-state: same encoding with _post suffix; toggled via ctx.stateMode (39-40, 477-478, 503-510, 589-592, 603-607).

12. Governance, proof-owned surfaces and change process

Originally 11_global_proof_governance.md. Issues #170, #171, #172, #174, #175.

12.1 Why governance exists

#88 is only tractable if the proof target stops moving invisibly. The failure mode is not "someone merges a bad theorem", it is Types.scala, the Z3 translator, or the routing/orchestration contract changing while the proof effort still assumes the old shape. That is how a global-proof project turns into a long-lived side quest with no honest ship claim.

M_G.1 therefore governs the proof target before proofs/lean/ exists. The goal is not to freeze the whole repo, the goal is to make target movement explicit.

12.2 Governance mode

The current mode is controlled churn. The repo is not in a hard IR freeze yet. Changes to proof-governed surfaces are still allowed, but those changes must carry their proof impact in the same PR.

12.3 Proof-owned surfaces (master list)

ClassSurfaceWhy it is governed
Proof-owned coreproofs/isabelle/SpecRest/IR.thy (extracted to ir/generated/SpecRestGenerated.scala)Defines expr/expr_full/type_expr_full/service_ir_full, the AST shapes the proof mirrors. Since #202 the Scala IR is auto-extracted; do not hand-edit the generated file.
Proof-owned coreverify/z3/Translator.scalaMain translation function; prover-side mirror tracks case-for-case.
Proof-owned coreverify/z3/Types.scalaZ3Script, Z3Expr, artifact structures in the first theorem target.
Proof-owned coreproofs/isabelle/SpecRest/{IR,Semantics,Smt,Translate,Soundness,Codegen}.thyThe canonical proof track. Universal soundness theorem closes with zero sorries.
Drift-control artifactverify/test/cert/generated/A8RoundTripOracleTest.scalaRound-trip oracle: every canonical probe runs through extracted lower → translate. Since #202 close-out, the projection is the extracted lower itself, no hand-rolled mirror remains.
Obligation contractverify/Classifier.scalaDecides which checks are in the Z3 proof scope.
Obligation contractverify/Consistency.scalaDefines the operational meaning of global, requires, enabled, preservation.
TCB-sensitiveparser/Parse.scalaParser remains trusted; changes can narrow/widen the honest claim.
TCB-sensitiveparser/Builder.scalaIR construction remains trusted.
TCB-sensitiveverify/z3/Backend.scalaRuntime renderer from Z3Script to Z3 ASTs.
Proof-owned CI.github/workflows/lean-certs.ymlSidecar matrix: per fixture, verify --emit-cert + lake build. Six fixtures as of M_L.4.a-k.
Proof-state ledgerproofs/lean/STATUS.mdPer-Expr-case mirror; PR template enforces re-sync on Expr changes.
PR contract.github/PULL_REQUEST_TEMPLATE.mdCarries the Expr-touch reminder fanning out to all of the above.

Two non-members:

12.4 Required change process

Any PR touching a proof-governed surface must:

  1. Update §13 Live Status Ledger in the same PR.
  2. Classify the change: proof-target shape change / obligation-routing change / TCB-only change / program-commitment change.
  3. State whether the M_G.0 theorem statement or TCB summary changed.
  4. If proof-safe profile membership changed, update §14 profile.
  5. If owner/priority/paused-work/stall-rule changed, update §15 runway.
  6. If activation state or kickoff shape changed, update §16 activation.

12.5 Freeze states

  • Ungoverned, normal repo churn (no proof-governed surface touched).
  • Governed, current state for §12.3 surfaces. Move freely with logging.
  • Frozen, temporary state for the M_L.2 closure window only. Surface may only change with matching proof update.

13. Live status ledger

Originally 12_global_proof_status.md. Issues #172, #174, #175.

13.1 Current baseline (post-M_L.4.a-k), first-ship gate met

  • Governance mode: execution track active; M_L.2 universal soundness closed for the §6.1 verified subset (zero sorry). M_L.4.a/b/c/d/e/f/g/h all merged.

  • First-ship claim status: MET as of 2026-05-02. The single-state idiom of the spec language (atoms, identifiers, scalar reads, all logical/arithmetic/comparison operators, state-relation membership / cardinality / Index / forall, FieldAccess (any entity-valued base, bare ident, Index result, chained .f1.f2, quantifier-bound vars), single-state Prime/Pre collapse, enum quantifiers and their Some/No/Exists composition aliases, NotIn composition, Subset over rel-identifiers via composition) is mechanically validated against the Z3 translator. The deployable claim:

    For every ServiceIR whose invariants and operation requires clauses fall in the §6.1 verified subset (now extended through M_L.4.k), z3.Translator.scala's output matches the Lean translate function case-for-case, and the Lean meta-theorem SpecRest.soundness proves that translation correlates eval with smtEval under the correlated SmtModel and SmtEnv. UNSAT verdicts on translated obligations therefore reflect properties of the spec, rather than just artifacts of the translator.

    Trust closure: M_L.1 IR/Semantics axioms + the Lean.ofReduceBool axiom that native_decide (used by per-run M_L.3 certs) introduces.

  • First theorem target: in-memory ServiceIR → Z3Script path used by Consistency.runConsistencyChecks.

  • Active proof-safe profile: §14, see §14.4 for the per-Expr case ledger.

  • Per-run translation-validation certs (M_L.3): working in CI matrix (.github/workflows/lean-certs.yml × 6 fixtures). Six fixture certs lake build clean; safe_counter 3/3 cert_decide, set_ops 5/11, todo_list 4/17, edge_cases 8/15, url_shortener 1/7, auth_service 0/21 (z3 backend errors unrelated to subset). Stub reasons remaining: Call builtins (len), multi-binding quantifiers, strings and collection shapes outside finite set literals, operation-input env binding, two-state preservation. None are soundness gaps, they are out-of-subset shapes that emit theorem cert : True := trivial with a TODO[M_L.4] marker and zero sorry. Nested FieldAccess (users[uid].email, forall t in tasks, t.field, chained .f1.f2) is no longer in this list, M_L.4.k brought it into the verified subset via the entity-id-keyed carrier.

  • Proof owner: HardMax71.

  • Runway: initial six-week M_G.3 cycle consumed; subsequent cycles are unscheduled. See §15 for stall rule.

  • Outside the first-ship claim (still genuinely deferred): SmtLib.scala, dump/export paths, Alloy-routed checks, proof replay, full-source semantics refinement, true two-state semantics for Prime/Pre (single-state collapse only, M_L.4.b-ext gates real preservation reasoning), With record-update (bundled with M_L.4.b-ext), map/sequence literals, set comprehensions, strings, Call/Matches, multi-binding quantifiers.

13.2 Status labels

LabelMeaning
trackedSurface is governed; changes must be logged.
mirroredProver-side mirror exists; soundness theorem incomplete.
fragment-provedA bounded fragment is mechanically proved.
ship-claim-readySurface is covered by the first honest public claim.
reservedPlanned but not yet active.

13.3 Governed surface ledger

The full ledger is maintained in proofs/lean/STATUS.md (per-Expr case granularity). The high-level Scala side is in §12.3.

13.4 Update rules

When the ledger changes, the entry should say at least:

  1. which governed surface moved,
  2. whether the move changed syntax / semantics / routing / TCB,
  3. whether the M_G.0 theorem statement still reads honestly afterward.

14. Proof-safe profile and backend contract

Originally 13_global_proof_profile.md. Issue #173. The committed first scope the global-proof program ships against. Updated post-M_L.4.a-k.

14.1 Decision summary

The global-proof program does not start from "all current language features." The committed profile is Z3-Core, a Z3-only fragment of the IR, and within it the bootstrap implementation slice Z3-Core-1S (one-state).

14.2 Stage labels

StageMeaning
bootstrapIn scope for Z3-Core-1S.
first shipRequired before the first public claim is honest.
deferOut of first ship; can enter only by explicit profile expansion.
excludeOutside the Z3 theorem track entirely.

14.3 Declaration-level profile

ConstructStageRule
EnumDeclbootstrapFinite domains; first bounded-quantifier story.
EntityDeclbootstrapFlat entities only.
StateDeclbootstrapScalar fields and simple relation/map fields over profile-safe types.
InvariantDeclbootstrapBody must be in-profile.
OperationDeclbootstrapInputs and requires in scope. ensures partial under single-state collapse (M_L.4.b); full two-state is M_L.4.b-ext.
TypeAliasDecl, FactDecl, FunctionDecl, PredicateDecl, TransitionDecl, ConventionsDecldefer
TemporalDeclexcludeAlways Alloy-routed; outside Z3 theorem.

14.4 Expression-level profile (post-M_L.4.a-k)

Expr caseStageRule / reason
BinaryOp(And | Or | Implies | Iff)bootstrapCore propositional layer. Soundness: M_L.2 closure.
BinaryOp(Eq | Neq | Lt | Gt | Le | Ge)bootstrapCore comparison layer. Soundness: M_L.2 closure.
BinaryOp(In)bootstrapState-relation domain membership. Soundness: M_L.2 closure.
BinaryOp(NotIn)bootstrapM_L.4.e closed via emitter-side composition: NotIn(e, r) ≡ ¬In(e, r).
BinaryOp(Add | Sub | Mul | Div)bootstrapM_L.4.a closed. Div-by-zero policy: eval returns none.
BinaryOp(Subset) over state-relation identifiersbootstrapM_L.4.i closed via emitter-side composition: Subset(r1, r2) ≡ ∀ x ∈ r1, x ∈ r2.
BinaryOp(Union | Intersect | Diff)bootstrapIssue #195 closed. Set-valued algebra over Value.vSet / SmtVal.sSet; non-set operands reduce to none.
UnaryOp(Not | Negate)bootstrapM_L.2 closed.
UnaryOp(Cardinality)bootstrapM_L.4.c closed. Restricted to state-relation identifiers (mirrors Translator.scala:876-881).
UnaryOp(Power)excludeRouted to Alloy.
Quantifier(All) over enum identifierbootstrapM_L.2 closure: per-case + universal soundness. Single binding over enum-name identifier.
Quantifier(All) over state-relation identifierbootstrapM_L.4.f closed. New forallRel Lean constructor + soundness_forallRel_known per-case theorem; EvalIR.State.demo populates relations with empty domains so quantifier is vacuously true.
Quantifier(Some | No | Exists)bootstrapM_L.4.d (enums) + M_L.4.f (state-rels) closed via emitter-side composition: ∃ x, P ≡ ¬ ∀ x, ¬ P.
SomeWrap, ThedeferOption/choice semantics.
Index (state-relation pair lookup)bootstrapM_L.4.g closed. Strictly-additive State.lookups/SmtModel.predLookup pair table; new Expr.indexRel + SmtTerm.indexRel + lookupKey_correlated bridge. Restricted to identifier-base.
FieldAccess (entity-valued base, any shape)bootstrapM_L.4.h + M_L.4.k closed. Entity-id-keyed State.entityFields / SmtModel.predFields table. M_L.4.h covered the bare-Identifier base; M_L.4.k generalises the constructor to take an arbitrary Expr base (Index, chained FieldAccess, quantifier-bound vars). Mirrors Translator.scala:981-1005.
EnumAccessbootstrapClosed via SmtTerm.enumElemConst.
Calldeferlen/isValidURI/dom need per-builtin semantics.
Prime, Prebootstrap (single-state collapse)M_L.4.b closed as identity. True two-state semantics is M_L.4.b-ext.
WithdeferIdentity-collapse would emit false certs. Requires Skolem mirror + StatePair.
IfdeferNeeds product / decidable encoding.
LetbootstrapM_L.2 closure.
LambdadeferOutside FOL.
ConstructordeferConstructor semantics deferred.
SetLiteralbootstrapIssue #195 closed. Non-empty standalone literals and empty set-literal membership render as nested .setInsert over .setEmpty.
MapLiteral, SeqLiteral, SetComprehensiondeferCollections deferred.
MatchesdeferRegex/string semantics deferred.
IntLit, BoolLit, IdentifierbootstrapM_L.2 closed.
FloatLit, StringLit, NoneLitdeferNo committed solver semantics.

14.5 Backend contract

QuestionDecisionConsequence
Which solver remains trusted in the first ship?Z3First theorem is Z3-trusting; no proof replay.
Backend-agnostic theorem?NoTarget is the current Z3 path.
Alloy in scope?NoAnything routed to Alloy is outside Z3-Core.
SmtLib.scala inside the first theorem?NoRuntime path uses Z3Script + WasmBackend.
Proof export / replay in scope?NoSeparate translation-validation track (M_L.3, shipped).
cvc5 cross-checking?NoDefense in depth only.
Parse.parseSpec Builder.buildIR Consistency.runConsistencyChecks Classifier z3.Translator WasmBackend.check Z3

14.6 Actual coverage after M_L.4.a-k

The originally-targeted Z3-Core-1S slice was: global and requires checks only; no Prime/Pre/With/Cardinality; no collections, strings, regex; quantifiers over enums only.

The actual covered slice is wider: full propositional layer + cmp + LIA arithmetic (Add/Sub/Mul/Div with Div-by-zero none policy) + let + enumAccess + state-relation membership (In) + all four quantifier kinds (All/Some/No/Exists) over enum-name identifiers + Prime/Pre (single-state collapse) + cardRel. Universal soundness theorem closes for this whole slice with zero sorry.

Still deferred: arbitrary collection algebra outside issue #195 set-valued Union/Intersect/Diff, map/sequence literals, set comprehensions, strings, Call, Matches, If, Lambda, Constructor, two-state Prime/Pre, With.

This widened slice does not include real ensures-clause preservation reasoning, that needs the StatePair refactor (M_L.4.b-ext, deferred). But it covers single-state invariants and operation requires for the bulk of real specs.

14.7 Expansion rule

A feature may move from defer into a later profile only if all of the following are true:

  1. Source-level semantics are explicit enough to state the theorem honestly.
  2. Current Scala translator support is full, or the restriction is narrow enough to state precisely here.
  3. The feature stays on the Z3 path; otherwise it starts a separate theorem track.
  4. Both the prover-side semantics and the Scala mirror have at least one concrete fixture.
  5. This profile and proofs/lean/STATUS.md are updated in the same PR.

15. Runway and stall policy

Originally 14_global_proof_runway.md. Issues #170, #174.

15.1 Decision summary (M_G.3)

The global-proof program is an active, bounded priority, rather than background research.

  • Owner: HardMax71.
  • Runway: one uninterrupted six-week proof-priority cycle, consumed during M_G.4 activation through the M_L.4.a-d shipped batch.
  • Scoping rule: fixed time, variable scope.
  • Fallback: if a future cycle stalls, shrink to Z3-Core-1S; if still stuck, switch primary trust-improvement back to expanded M_L.3 cert work.

15.2 Reprioritized roadmap while active

Lanes paused while a runway cycle is active:

LaneIssuesRationale
LLM synthesis track#27-#32Orthogonal; too large to run in parallel solo.
New target expansion / distribution#33-#36, #56Widens surface area without helping proof ship.

Lanes secondary-only (allowed if narrowly urgent):

LaneIssuesRule
Auth/security#53-#55Move only if urgent or blocks external need.
Maintenance and experiments#149, #150, #161, #163Not allowed to displace proof time.

Always allowed: build-break/CI fixes; correctness or security regressions in shipped behavior; narrowly-scoped doc fixes; parser/IR/translator changes that directly support the proof track and update governed docs.

15.3 Interrupt policy

The runway provides uninterrupted time, rather than at the cost of leaving the repo broken. Interrupts are allowed for: red CI; correctness bugs in current verification; security issues; narrowly-bounded maintenance. Interrupts are not a license to resume paused roadmap themes "for a day."

15.4 Circuit breaker

  1. At the end of a six-week cycle, do not auto-renew.
  2. If the cycle produced active artifacts but scope was too broad, shrink to Z3-Core-1S.
  3. If Z3-Core-1S still fails, pause the universal-theorem push and switch primary trust goal to M_L.3-style cert work.
  4. Keep any reusable semantics kernel artifacts; do not discard formalization just because the full theorem stalls.

"Meaningful theorem progress" means at least one of: a merged prover-side semantics artifact connected to live Scala structures; a merged mirror artifact covering real translator cases; a checked proof fragment stronger than scaffolding. Empty scaffolding, pinned toolchains without proof code, "proof soon" status notes do not count.


16. Activation record and kickoff shape

Originally 15_global_proof_activation.md. Issue #175. Closes the gap between readiness planning (M_G.*) and active theorem work (M_L.*).

16.1 Activation decision

M_G.4 activated the M_L.* execution track on 2026-05-01. As of that date:

  • M_G.0 through M_G.3 are satisfied.
  • HardMax71 is the active owner.
  • #126 was unblocked as the first implementation issue.
  • #127-#130 remained gated only by predecessor milestones.

16.2 Gate review

GateArtifactSufficient because
M_G.0, theorem statement and TCBThis doc §1 + issue #171First honest theorem target written down.
M_G.1, governed proof surfacesThis doc §12Target movement is visible.
M_G.2, proof-safe profileThis doc §14First theorem scope is smaller than full language.
M_G.3, owner / runway / fallbackThis doc §15Named owner, bounded runway, paused lanes, circuit breaker.

16.3 What got unblocked

After M_G.4, the dependency chain is:

  • #126 (M_L.0) → active. Closed.
  • #127 (M_L.1), blocked on #126 only. Closed.
  • #128 (M_L.2), blocked on #127. Closed for §6.1 subset, zero sorry.
  • #129 (M_L.3), blocked on #127. Closed.
  • #130 (M_L.4), blocked on #128. Sub-slices a-k plus issue #195 set algebra closed (LIA arithmetic, single-state Prime/Pre, Cardinality, enum quantifier composition, NotIn composition, state-relation quantifier, Index, FieldAccess on state scalars, Subset over rel-identifiers via composition, nested FieldAccess via entity-id-keyed carrier, set literals and set-valued Union/Intersect/Diff/membership). Remainder (With, true two-state, Call, multi-binding quantifiers, strings, maps/sequences, set comprehensions) deferred to later slices or M_L.4.b-ext.

16.6 First-ship gate met

As of 2026-05-02, the activation umbrella's success conditions are satisfied:

ConditionArtifact
Stable theorem targetsoundness theorem in proofs/isabelle/SpecRest/Soundness.thy (zero sorries; closes universally over the verified subset).
Explicit TCBIsabelle/HOL kernel + IR.thy / Semantics.thy definitions + Z3 driver + extracted-Scala output. Per-run certs were deleted in Phase 6 (Option 2 of #193 review) so native_decide is no longer in the TCB.
Frozen / governed IR surfaceVerified subset is lower's codomain in proofs/isabelle/SpecRest/IR.thy; both expr (subset) and expr_full (input) are extracted from the same theory, so drift between Scala and proof-side IR is structurally impossible. A8RoundTripOracleTest smokes the round-trip on the canonical probe corpus.
Proof-safe first scope§14.4 verified-subset profile (post-M_L.4.a-k).
Active contributor commitmentM_L.0 → M_L.4.k shipped between PR #180 (2026-04-30) and #196 closure (2026-05-02).
Linked kickoffM_L.0 PR #180 (combined M_L.0 + M_L.1 first slice).

The proof program has moved from "research direction" to "shipped first-ship claim". Remaining M_L.4 slices (Subset composition, multi-binding quantifier, Call(len) builtin) are coverage uplifts, rather than theorem-program prerequisites. M_L.4.b-ext (true two-state preservation) remains a multi-week effort that should be scheduled deliberately rather than chipped at piecemeal.

16.4 First m_l PR shape (historical record)

The first M_L implementation PR was a combined M_L.0 + first M_L.1 slice rather than a standalone scaffolding PR. Minimum honest kickoff shape (delivered in PR #180):

  • proofs/lean/lean-toolchain pinned to leanprover/lean4:v4.29.1
  • proofs/lean/{lakefile.toml, README.md, STATUS.md}
  • proofs/lean/SpecRest/{IR.lean, Semantics.lean, Examples.lean, IR.lean.todo}
  • .github/workflows/lean.yml

Required substance: a real deep embedding for Z3-Core-1S; a real semantic domain; at least one checked example or lemma; a status table. Not acceptable: namespace-only scaffolding; toolchain pin plus empty files; CI-only Lean setup with no semantics.

16.5 Activation invariants

  • proofs/lean/ may not appear in main until it lands with active proof content. (Now satisfied, M_L.0 onward.)
  • Changes to proof-governed Scala surfaces still require status updates in the matching proof docs (now §13 of this doc).
  • If a future six-week runway fails to produce meaningful theorem artifacts, follow the circuit breaker in §15.4.

Activation is a commitment to start, rather than permission to drift.

17. IR canonicalization in Isabelle (issue #202, post-#193)

Status (2026-05-06): shipped. PR #204 (Phases 0–7 + 9 + 10a) merged 2026-05-06; this close-out PR extends lower v2 over QuantifierF (4 kinds, multi-binding) + multi-field WithF, deletes the hand-rolled VerifiedSubset.classify and A8RoundTripOracleTest.toExtracted walker, refreshes docs. The Scala IR is now canonically extracted from proofs/isabelle/SpecRest/IR.thy; the audit oracle calls extracted lower directly. Issue #202 closed.

17.1 Decision: C-hybrid

We are migrating to a single canonical IR ADT in Isabelle, with Scala-side types regenerated via Code_Target_Scala. The chosen architecture is C-hybrid rather than pure C ("regenerate Scala from existing Isabelle expr"):

  • Existing proofs/isabelle/SpecRest/IR.thy expr (23-ctor verified subset) stays unchanged at the proof level, the 94 lemmas in Soundness.thy are not re-opened.
  • A second ADT expr_full (27 ctors mirroring the full Scala input language including Float/String/None/Lambda/Call/Constructor/MapLiteral/SeqLiteral/Matches/SomeWrap/The/ SetComprehension and the broader bin_op_full/un_op_full enums) is added to the same IR.thy.
  • A lower function lower :: String.literal list ⇒ expr_full ⇒ expr option projects the full IR onto the verified subset; out-of-subset constructors become None. The first argument is the list of declared enum names, needed because a QuantifierBindingFull v (IdentifierF dom _) _ _ could resolve to either ForallEnum (enum domain) or ForallRel (relation domain), and lower cannot decide without schema context. lower is code-extracted; it replaced the hand-written Scala VerifiedSubset.classify and the test-side A8RoundTripOracleTest.toExtracted walker (both deleted in the #202 close-out).
  • expr_full is the canonical input-language ADT for all Scala consumers (parser, codegen, testgen, translator). They consume it via a thin Scala wrapper layer in modules/ir/.../Types.scala (type aliases + apply/unapply + extensions) that preserves the existing ergonomics.

17.2 Why not pure c

The verified-subset expr is not a renaming of the Scala input ADT, it is a strict subset (23 vs 27 ctors, op-specialized vs op-parametric, no Float/String/None/Lambda/Call/Constructor/MapLiteral/SeqLiteral/Matches/SomeWrap/The/ SetComprehension/In/NotIn/Subset). Replacing the Scala IR wholesale with the verified subset would delete input-language features. C-hybrid keeps the verified subset stable while introducing expr_full for the input language and lower as the proven projection.

17.3 Span handling, option (a) inline

Every ctor of both expr and expr_full carries a final option_span (= span_t option) field, where span_t is a 4-int datatype. Span is Prop-erased, soundness theorems do not inspect it; existing per-case lemmas absorb the new field as a wildcard. No proof content changes; ~94 mechanical wildcard updates in Soundness.thy (Phase 2).

Rejected alternatives:

  • Wrapper Spanned[A], flattens inner-subexpression spans, regressing diagnostics.
  • Parametric 'a expr, re-triggers #193 Phase-5 record-polymorphism crash (Illegal fixed variable 'a'), well-known landmine.

17.4 No-new-files constraint

  • All new Isabelle types and lower/lower_reason functions land in the existing proofs/isabelle/SpecRest/IR.thy (104 → ~750 LoC).
  • Soundness.thy ripple (option_span wildcards) edits in place across Semantics.thy, Smt.thy, Translate.thy, Soundness.thy.
  • Codegen.thy export_code list extends in place.
  • modules/verify/.../cert/generated/SpecRestGenerated.scala was moved (not created) to ir/generated/SpecRestGenerated.scala in #202 Phase 4, required to avoid an ir → verify dep cycle when ir consumers import the extracted types.
  • modules/ir/.../Types.scala is rewritten in place into a wrapper layer (~600-800 LoC: type aliases, apply/unapply, extensions, given CanEqual, hand pretty-printer).

17.5 Trigger

The 2026-05-02 trigger conditions in #202 §"Trigger conditions" (drift bug, second emitter, ADT growth, dedicated infra contributor) are not met. This is being executed as a user-directed architectural cleanup.

17.6 Phase status

PhaseDeliverableStatus
0Decision document (this §17)shipped (PR #204)
1expr_full + 20 records + span_t in IR.thy; extraction smokeshipped (PR #204)
2option_span ripple across IR/Semantics/Translate/Soundnessshipped (PR #204)
3lower v1 + lower_set_list + lower_soundness corollaryshipped (PR #204)
4Codegen.thy export extension; relocate SpecRestGenerated.scalashipped (PR #204)
5Scala wrapper layer in Types.scala (extracted types used directly; thin layer)shipped (PR #204)
6Parser migrationshipped (PR #204)
7Codegen + testgen migrationshipped (PR #204)
8Verify migrationshipped (PR #204); lower-backed Trust dim. + --strict-soundness flag landed under #205 (Phase C-flagged); strict-soundness flipped to default-on and extracted-translator routing landed under #192
9CLI migration; delete legacy Types.scala enumshipped (PR #204)
10Audit cleanup; doc/CI sweep10a doc sweep shipped (PR #204); 10b audit cleanup + lower v2 (QuantifierF 4 kinds, multi-binding, multi-field WithF) shipped in close-out PR

17.7 Trust surface (issues #205#192)

lower shipped as a runtime projection in #202; #205 wired it into the verifier surface as a per-check Trust dimension behind a --strict-soundness flag; #192 promoted strict soundness to the default and routed in-subset checks through the verified extracted translator on the production verify path.

  • CheckResult.trust: TrustLevel, Sound if every source expr_full for the check lowers to the verified subset; BestEffort otherwise. Visible in:
    • CLI line: [sound] / [best-effort] tag between the tool tag and the check id.
    • JSON report: "trust": "sound" | "best-effort" per check object. Schema bumped 1→2.
  • Best-effort checks always skip with category = soundness_limitation before backend dispatch (no flag, this is the default since #192). The --strict-soundness flag and its CLI surface were retired.
  • Sound checks route via extracted translator. Translator.translateExpr calls lower(enums, e); on Some it routes through SpecRestGenerated.translate (the Isabelle-extracted function) and then a small SmtTermToZ3 bridge (~250 LOC, hand-written) to reach Z3. On None the hand-written translateExprRaw runs, but only for declaration-level expressions (entity field constraints, type-alias where-clauses); out-of-subset check-body shapes are filtered to skip by the Trust classifier before ever reaching the translator.
  • Exit code 4 (ExitCodes.Trust): emitted when soundness skips are the only reason the run isn't clean. Subordinate to Backend (3), Violations (1), Translator (2).
  • TCB audit lives at docs/research/11_tcb_audit.md, the single ledger of what verify's verdicts actually depend on.
Coverage v3 (post-#210)lower returnsTrust
BoolLit / IntLit / Ident / EnumAccess / arithmetic / compare / boolean / Let / FieldAccess / Prime / Pre / cardinality on Ident / In/NotIn against IdentSomeSound
QuantifierF over enum or relation domain (4 kinds, multi-binding)SomeSound
Multi-field WithF (folded) over Identifier baseSomeSound
IndexF over IdentifierF base (e.g. users[uid])SomeSound
IndexF over PreF (IdentifierF rel) (e.g. pre(orders)[id]), M_L.4.l carrier wideningSomeSound
IndexF over PrimeF (IdentifierF rel) (e.g. orders'[id]), M_L.4.l carrier wideningSomeSound
IndexF over a non-relation-reference base (arithmetic, nested IndexF, etc.)NoneBestEffort
BSubset, CallF, IfF, TheF, MapLiteralF, ConstructorF, SetComprehensionF, etc.NoneBestEffort

Issue #210 (M_L.4.l) widened the verified-subset IndexRel carrier from String.literal × expr to expr × expr and added a syntactic peel_relation_ref recogniser for the three relation-reference shapes (Ident, Pre Ident, Prime Ident). The widening avoids the ir_value cascade that a value-level VRelation would have triggered: eval/smt_eval peel syntactically, the universal soundness theorem closes with one rewritten per-case lemma (index_rel_step) plus a peel_smt_relation_refpeel_relation_ref correlation lemma, and the production bridge resolves the relation name + state-mode at the TIndexRel site (rather than inheriting from the surrounding withStateMode for bare IndexRel shapes). auth_service and broken_url_shortener Tamper-style operations flip from skipped (category=soundness_limitation) to real Z3 verdicts.

17.8 Risks tracked

  • R1, Code-extraction quality on records. #193 Phase 5 hit Illegal fixed variable 'a' on polymorphic record-scheme. The fix was [code]-marked defs. ~20 new records in this issue re-test the workaround. Phase 1 smoke-build is the trip-wire.
  • R2, Wrapper-layer drift. ~600-800 LoC of apply/unapply in the Scala wrapper trades A1-A8 drift for wrapper-level drift. Mitigation: round-trip wrapper test exercising all 27 ctors.
  • R3, Span-noise volume. ~94 _ wildcard insertions across Soundness.thy. Buffer +1 day if cases ... simp_all patterns break.
  • R4, In/NotIn/Subset desugar moves from Scala to Isabelle. Phase 3's lower performs the rewrite Isabelle-side. Defer proof obligation to a follow-up; keep the rewrite as a definition only in v1.
  • R5, Diagnostic regression on e.toString. Hand pretty-printer in Phase 5 mitigates; snapshot-diff vs current parser-error fixtures.
  • R6, Native-image breakage. Wrapper layer + extracted types may need new reflect-config. Phase 10 sbt nativeImage smoke.

On this page

Table of contents1. Status and framing2. The trust chain today3. Why "reconstruct Z3 proofs" does not work in 20263.1 Z3 has no alethe export3.2 cvc5-Alethe does not cover datatypes3.3 Quantifier-instantiation proof bloat3.4 What this means4. Prior art: 2024-2026 Snapshot4.1 Cohen, "a foundationally verified intermediate verification language" (princeton phd, 2025)4.2 Parthasarathy et al., "towards trustworthy automated program verifiers" (PLDI 2024)4.3 Lean-smt and Isabelle/HOL alethe reconstruction4.4 Verified VCG for Dafny (nezamabadi/myreen/tan, dec 2025)4.5 F* → SMT encoding (aguirre/hriţcu, 2016 → ongoing)4.6 No mechanized Alloy or Dafny surface semantics4.7 Older prior art worth knowing4.8 Closest-prior-art summary table5. Two paths and why we recommend both, sequenced5.1 Path a: Translation validation (cheap, recommended first)5.2 Path b: Meta-soundness (expensive, optional follow-up)5.3 Why both, sequenced6. The verified subset6.1 Operators in scope (M_L.1)6.2 Operators explicitly out of scope (deferred)6.3 Sort encoding in M_L.1's translator7. Picking a proof assistant7.1 Three candidates compared7.2 Decision: Isabelle/HOL (post-#193 pivot)7.4 Historical: Lean 4 was the original choice7.3 Embedding shape (locked across all three candidates)8. Milestone decomposition8.1 M_L.0, scope, scaffolding, contributor handoff8.2 M_L.1, IR denotational semantics for the verified subset8.3 M_L.2, translator soundness theorem for the verified subset8.4 M_L.3, translation validation (per-run certificate emission)8.5 M_L.4, subset expansion9. Risks9.1 Doc rot9.2 Z3 verdict drift across versions9.3 Lean toolchain churn9.4 Contributor abandonment mid-milestone9.5 The semantics-layer trap (cohen)9.6 The f* trap9.7 Mismatch between lean translate and Scala translator.scala10. Non-goals11. References11.1 Closest prior art11.2 SMT proof reconstruction in proof assistants11.3 Z3 proof formats11.4 Spec-language semantics formalizations11.5 Embedding-style references11.6 Lean 4 / mathlib4 ecosystem11.7 Spec_to_rest cross-referencesA. codebase translator coverage (april 2026)12. Governance, proof-owned surfaces and change process12.1 Why governance exists12.2 Governance mode12.3 Proof-owned surfaces (master list)12.4 Required change process12.5 Freeze states13. Live status ledger13.1 Current baseline (post-M_L.4.a-k), first-ship gate met13.2 Status labels13.3 Governed surface ledger13.4 Update rules14. Proof-safe profile and backend contract14.1 Decision summary14.2 Stage labels14.3 Declaration-level profile14.4 Expression-level profile (post-M_L.4.a-k)14.5 Backend contract14.6 Actual coverage after M_L.4.a-k14.7 Expansion rule15. Runway and stall policy15.1 Decision summary (M_G.3)15.2 Reprioritized roadmap while active15.3 Interrupt policy15.4 Circuit breaker16. Activation record and kickoff shape16.1 Activation decision16.2 Gate review16.3 What got unblocked16.6 First-ship gate met16.4 First m_l PR shape (historical record)16.5 Activation invariants17. IR canonicalization in Isabelle (issue #202, post-#193)17.1 Decision: C-hybrid17.2 Why not pure c17.3 Span handling, option (a) inline17.4 No-new-files constraint17.5 Trigger17.6 Phase status17.7 Trust surface (issues #205#192)17.8 Risks tracked