Translator Soundness, Isabelle/HOL Proof Track
Edit on GitHubMaster doc for the spec_to_rest translator-soundness program. Scoping, governance, status, profile, runway, and milestone progress for proving the IR → Z3 verification path sound in Isabelle/HOL.
Last updated:
Status banner (2026-05-04): pivoted to Isabelle/HOL. The proof track migrated from Lean 4 to Isabelle/HOL via issue #193. The universal soundness theorem closes with zero
sorryin Isabelle,Code_Target_Scalaproductionizes the extraction, and the Lean track atproofs/lean/was retired. Per-run cert infrastructure was deleted post-pivot (vestigial after universal soundness, Option 2 of #193 review).Most of this document is pre-pivot context, the comparative analysis in §7, the prior-art survey in §4, and the milestone decomposition in §8 still reflect the original Lean-era framing. They're kept for historical clarity. Sections updated for the post-pivot state are marked with a "post-#193" tag.
Master document for the translator-soundness program. Originally six docs (10/11/12/13/14/15); consolidated here in 2026-05-02 after the M_L.2-closure + M_L.4.a-g shipped batch. Anchors issue #88 (translator soundness), umbrella #170, and execution-chain #126–#130.
Table of contents
- Status and Framing
- The Trust Chain Today
- Why "Reconstruct Z3 Proofs" Does Not Work in 2026
- Prior Art: 2024-2026 Snapshot
- Two Paths and Why We Recommend Both, Sequenced
- The Verified Subset
- Picking a Proof Assistant
- Milestone Decomposition
- Risks
- Non-Goals
- References
- Governance, proof-owned surfaces and change process
- Live status ledger
- Proof-Safe Profile and Backend Contract
- Runway and Stall Policy
- Activation Record and Kickoff Shape
1. Status and framing
Issue #88 asks for a mechanically checked correctness proof of spec_to_rest's
spec → IR → Z3 verification path. The issue itself flags this as unscheduled,
research-flavored, easily one person-year of work, filed primarily to give the
capability a concrete home so it isn't absorbed into other milestones.
This doc updates that framing for 2026:
- The Z3-proof-replay path that #77 partly assumed has not materialized. Z3
4.13 (the version this project pins via
z3-turnkey) emits only its undocumented 2008-era natural-deduction term tree;:proof_format alethewas never shipped in any Z3 release, and quantifier instantiations remain opaque (see §3). - The closest published prior art (Cohen, Princeton PhD, 2025) is a 5-person-year effort that verifies five Why3 IVL transformations in Coq, and even Cohen left monomorphization, the SMT-LIB printer, and end-to-end SMT soundness in the trusted computing base (TCB).
- A cheaper, more recent template exists: per-run translation validation à la Parthasarathy et al. (PLDI 2024 / POPL 2025) emits an Isabelle proof certificate for each verifier run showing "if the IVL output is correct then the source is correct." This is the realistic 2026 path for a project our size.
The recommendation in this doc is therefore: decompose #88 into a tractable five-milestone plan (M_L.0 → M_L.4) that ships translation validation first, then moves toward meta-soundness only for the verified subset (§6) once a contributor signs up for the M_L.2 commitment.
As of M_G.0, the first honest theorem target is the in-memory
ServiceIR → Z3Script path used by Consistency.runConsistencyChecks, rather than the
optional SmtLib.scala exporter. The exporter stays outside the first ship claim
until a later milestone.
Status: execution track shipped. M_G.3 committed the runway, M_G.4 activated
M_L.*, and the post-pivot Isabelle/HOL track delivered through Phase 7
(see STATUS.md).
Issue #88 closed 2026-04-26 with
all M_L sub-issues (#126–#130)
resolved. The universal soundness theorem closes with zero sorry,
Code_Target_Scala produces the production translator at
ir/generated/SpecRestGenerated.scala, and the
A8RoundTripOracleTest exercises every in-subset shape end-to-end. Per-run
certificate emission was deleted post-pivot (vestigial after universal soundness,
Option 2 of #193 review).
2. The trust chain today
Today, "we proved this spec correct" is shorthand for the following chain:
Each link is a potential silent-failure point:
| Link | Failure mode |
|---|---|
| Prose semantics → IR | Spec language has only English-prose semantics; nothing to refine the IR builder against. |
| IR builder | Hand-written; tested via fixtures, rather than proven. |
| Z3 Translator | 1917 LOC of Scala. 13 of 25 Expr cases are fully translated, 8 partial, 4 raise TranslatorError. The encoding choices for entities (uninterpreted sort + field functions), state (pre/post functions), and quantifier domains are defensible but unverified (see codebase-analysis appendix below). |
| Z3 itself | Has had soundness CVEs historically. Pinning at 4.13 deters silent verdict flips on upgrade but does not eliminate the trust assumption. |
Mechanically verifying every link is far beyond a project our size. The smallest useful target is the IR → Z3 step: it's the largest hand-written piece, the one whose semantic behaviour we control, and the one closest to user-visible verdicts.
3. Why "reconstruct Z3 proofs" does not work in 2026
Issue #77 (closed) sketched a "verify-as-gate" path that gestured at proof export and Alethe-via-Z3 as a future direction. Research as of April 2026 says that path is blocked at the Z3 side:
3.1 Z3 has no alethe export
Direct search of the Z3 release notes
finds no mention of "alethe" in any release. Issue search alethe repo:Z3Prover/z3
returns zero results. The release notes mention only:
- 4.11.2: "change proof logging format for the new core to use SMTLIB commands. The format was so far an extension of DRAT used by SAT solvers"
- 4.12.0: "sat.smt.proof.check_rup ... apply forward RUP proof checking"
Z3 4.13 (the version pinned by z3-turnkey % 4.13.0.1 in build.sbt) emits only its
undocumented IWIL 2008 natural-deduction proof
format. Quantifier instantiation steps (quant-inst) appear with no machine-readable
witness justification, the very steps that dominate spec_to_rest's preservation
checks (5 invariants × 10 ops = 50 quantifier scopes per service).
3.2 cvc5-Alethe does not cover datatypes
The cvc5 Alethe documentation says verbatim:
Currently, the theories of equality with uninterpreted functions, linear arithmetic, bit-vectors and parts of the theory of strings (with or without quantifiers) are supported in cvc5's Alethe proofs.
Datatypes are not in this list. spec_to_rest's IR translator emits
declare-datatypes for entity records, sums, and option types (see
codebase analysis). A cvc5-as-proof-certifier
backend therefore requires re-engineering the SMT encoding to be datatype-free
(records as parallel UF arrays, sums as tag+payload via UF), a non-trivial rewrite,
not a flag flip.
3.3 Quantifier-instantiation proof bloat
E-matching with non-trivial trigger sets routinely yields 10³–10⁵ ground instances per quantifier (see Reynolds, SMT 2023 and DSLab "Conjecture Regarding SMT Instability"). For a typical preservation suite, expect Alethe proof files in the tens to hundreds of MB, with Carcara check times in minutes. This is workable for one-off audits, prohibitive for CI-on-every-PR.
3.4 What this means
The "Z3 emits a checkable proof + Carcara/ITP replays it" architecture from #77 is not viable in 2026. Three implications:
- Translator soundness must be proven as a meta-theorem about our
translatefunction, rather than by replaying each Z3 run's proof. - The Z3 verdict remains an oracle in our trust base. This is the same posture as F*, Dafny, Verus, and Why3-O, none of them check Z3 proofs; they verify the encoder.
- An "external solver agreement" CI job is still useful as cheap defense in depth: emit our SMT-LIB, run cvc5 in parallel, alert on disagreement. Doesn't need proof export. Belongs in #77 follow-up, rather than here.
4. Prior art: 2024-2026 Snapshot
A 12-month survey produced the following landscape. Each entry tagged with how applicable it is to spec_to_rest's IR→Z3 translator.
4.1 Cohen, "a foundationally verified intermediate verification language" (princeton phd, 2025)
Closest match to #88. Defines Why3's P-FOLDR (Polymorphic First-Order Logic with Datatypes and Recursion) in Coq, gives it a denotational semantics, and proves five Why3 transformations sound:
eliminate_definition(recursive funs → unfolding axioms)- elimination of inductive predicates
compile_match(the first machine-checked pattern-matching compiler)- ADT axiomatization
- a few smaller passes
Five person-years, single PhD student plus advisor and Sandia mentor. Coq 8.20,
extracted to OCaml as Why3-O (plug-compatible with the real Why3 OCaml API).
What stays in the TCB even after this PhD: monomorphization, the SMT-LIB printer, the SMT solver itself.
Pitfalls Cohen called out, all of which apply to us:
- Well-typedness preservation under context-modifying transformations is harder than soundness.
- Pattern-matching compilation termination needs a non-obvious well-founded measure.
- Mixed record-inductive types fight Coq's positivity checker.
- ADT axiomatization (with non-uniform constructors and metadata) is the single hardest transformation.
- The semantics layer (the formal language definition) absorbed more time than the translator transformations.
Applicability to spec_to_rest: the same architectural skeleton transfers, we'd
build a deep-embedded Expr ADT in Lean/Coq/Isabelle, a denotational semantics, and
prove eval e = smtEval (translate e) for our subset. We would not attempt to
match Cohen's depth (polymorphism, recursive funs, inductive preds); the verified
subset (§6) is intentionally smaller.
4.2 Parthasarathy et al., "towards trustworthy automated program verifiers" (PLDI 2024)
Different design: instead of proving the language-to-IVL translator sound once and for all, emit an Isabelle proof certificate for each verifier run that shows "if the IVL program is correct then the source program is correct." arXiv 2404.03614.
POPL 2025 sequel (Formal Foundations for Translational Separation Logic Verifiers) extends to Viper-style separation logic.
Why this matters: per-run translation validation is dramatically cheaper than per-language meta-soundness, no need to formalize the entire source language semantics, only enough to express each instance. The certificate kernel can be small (hundreds of LOC), checkable quickly, and updates automatically as the translator evolves.
Trade-off: certificates cover only inputs we've actually translated. Meta- soundness covers the universe of well-typed inputs.
4.3 Lean-smt and Isabelle/HOL alethe reconstruction
lean-smt (CAV 2025): a Lean 4 tactic that translates a Lean goal to SMT-LIB, hands it to cvc5 (not Z3), and replays the resulting Alethe proof in the Lean kernel. ~71% of cvc5 proofs reconstruct (15,271 of 21,595 benchmarks); 98% of successful reconstructions complete in <1s.
Isabelle Alethe pipeline (ITP 2025): extends Sledgehammer's veriT-only reconstruction (Schurr/Fleury, 2019) to also support cvc5. Five years on, both veriT and cvc5 reconstruct in Isabelle/HOL.
Wrong direction for us. These tools answer: given a goal in Lean/Isabelle, can we discharge it via SMT and replay the proof? Our question is the inverse: given our hand-written translator from spec-IR to SMT-LIB, is it sound? lean-smt and Sledgehammer would only help us as authoring tools for the meta-soundness proof, we can use them to discharge sub-lemmas, rather than to validate the translator.
Z3 is also absent from both pipelines: Sledgehammer's Z3 oracle uses the legacy proof format which is unmaintained, and lean-smt is cvc5-only.
4.4 Verified VCG for Dafny (nezamabadi/myreen/tan, dec 2025)
arXiv 2512.05262. Big-step semantics for an imperative Dafny subset (mutually-recursive methods, while loops, arrays, no records, sets, quantifiers, or partial functions) plus a verified VCG plus a verified compiler to CakeML. HOL4. The VCG produces verification conditions in HOL4's logic, not SMT-LIB, they short-circuit the SMT step entirely.
Useful as a template for the VCG side (preservation obligations as HOL/Lean propositions), less useful for the SMT-encoding side because they don't generate SMT.
4.5 F* → SMT encoding (aguirre/hriţcu, 2016 → ongoing)
Negative result. Towards a Provably Correct Encoding from F* to SMT formalized a fragment in Coq; ten years later no completion. Reason: refinement types + monadic effects too rich for the formalization to remain tractable.
Lesson for us: keep the source language rigidly small. Resist scope creep into the rich corners of the IR (lambdas, set comprehension, the-operator) until the simple core works.
4.6 No mechanized Alloy or Dafny surface semantics
Search did not surface a Coq/Isabelle/Lean formalization of Alloy's relational semantics (Daniel Jackson, Software Abstractions) nor of Dafny's reference manual. Astra (UWaterloo, 2019) evaluated Alloy→SMT-LIB empirically but did not prove it. The most actively developed comparable formalization is TLA* in Isabelle/HOL by Grov & Merz (AFP, 2011-2025), shallow-embedded, covers the temporal layer, rather than a translator to SMT.
Lesson: any spec language semantics we mechanize is novel work.
4.7 Older prior art worth knowing
- CompCert (~100 kLOC Coq, 6 person-years) sets the upper bound on this kind of work and is the architectural inspiration.
- AliveInLean (CAV 2019) verifies a peephole-optimization checker for LLVM in Lean 4, closest to "verified SMT encoder" in the LLVM space, but limited to bit-vector and array peepholes.
- HOL-Boogie (TPHOLs 2008) embeds Boogie's output into Isabelle/HOL for interactive VC discharge, does not verify the translator itself.
- SMTCoq (smtcoq.github.io) verifies proofs returned by SMT solvers (veriT/CVC4/cvc5), rather than encodings into them. Quantifier-free. Different problem.
4.8 Closest-prior-art summary table
| Project | Source | Prover | What they proved | Pitfall |
|---|---|---|---|---|
| Cohen, Why3-in-Coq (2025) | Why3 P-FOLDR | Coq | 5 IVL transformations sound | 5 person-years; monomorphization & SMT-LIB printer in TCB |
| Nezamabadi et al., Dafny VCG (2025) | Dafny subset | HOL4 | VCG sound; CakeML compiler correct | Bypasses SMT; no records/sets/quantifiers |
| Parthasarathy et al., Trustworthy Verifiers (2024-25) | Viper | Isabelle | Per-run forward-simulation cert | Cert covers only translated inputs, rather than all inputs |
| Aguirre/Hriţcu, F*→SMT (2016-) | F* fragment | Coq | Soundness of fragment encoding | Stalled 10+ years; F* too rich |
| AliveInLean (2019) | LLVM IR peepholes | Lean 4 | Encoder verified | BV/array only |
| Grov/Merz TLA* in Isabelle (2011-25) | TLA* | Isabelle | Embedding + derived rules | No translator to SMT |
5. Two paths and why we recommend both, sequenced
5.1 Path a: Translation validation (cheap, recommended first)
Per-run certificates following Parthasarathy 2024. For each verifier run, emit a proof object (in Lean/Isabelle/Rocq) showing that for this specific IR, the SMT output is a correct encoding.
Cost: ~3-6 person-months for the certificate kernel + emission glue. The
certificate kernel is small (hundreds of LOC) and stable; the per-run emission is
mechanical (one proof step per translate case).
Coverage: only inputs we've actually translated.
Trust assumption: the certificate-checking kernel + its embedding of the semantic domain.
Why first: ships verifiable trust improvement in months, rather than years. Acts as a forcing function for documenting the IR's intended semantics (which doesn't exist in any machine-checkable form today).
5.2 Path b: Meta-soundness (expensive, optional follow-up)
Cohen-style. Deep-embed Expr and TypeExpr in a proof assistant; define a
denotational semantics; define translate as a function in the proof assistant;
prove denote_smt(translate(e)) = denote_ir(e) for every well-typed e in the
verified subset.
Cost: 6-12 person-months for the verified subset (§6) at expert rates; double-to-triple at non-expert rates. Per A5's analysis: ~1,350 LOC for the semantics layer (M_L.1) and ~3-5× that for the soundness theorem itself (M_L.2).
Coverage: all well-typed inputs in the verified subset.
Trust assumption: the proof assistant's kernel + the embedding's accuracy
(deep Expr, shallow semantic domain, see §7.2).
Why deferred: only worth doing if a contributor signs up for the multi-month commitment, and only after Path A has documented the IR semantics rigorously enough to enable it. Without Path A, M_L.1 is starting from scratch on the semantics; with Path A, M_L.1 reuses a Path-A-validated semantic skeleton.
5.3 Why both, sequenced
Path A's certificate kernel is Path B's meta-soundness skeleton, scoped to a single input. Building A first means:
- Faster initial trust improvement (months, rather than years).
- Forcing-function for writing the IR semantics in a machine-checkable form.
- The artifact built in A is reusable: M_L.2's "translation soundness" is the universal-quantifier version of A's per-input certificate.
The alternative, start with B, risks Cohen's outcome at smaller scale: 9 months into the semantics layer with no end-to-end deliverable.
6. The verified subset
Anchored to the codebase analysis.
Per April 2026, Translator.scala covers 13/25 Expr cases fully, 8 partially,
4 with TranslatorError. The verified subset for M_L.1 picks the smallest set
that exercises the four core SMT pillars:
- Boolean reasoning (∧, ∨, ⇒, ¬)
- Linear integer arithmetic (=, <)
- Uninterpreted predicates (membership in a state relation)
- Bounded quantification (over enums; over fixed-size entity collections)
6.1 Operators in scope (M_L.1)
| Expr case | Why included |
|---|---|
BinaryOp(And, _, _) | Boolean conjunction |
BinaryOp(Or, _, _) | Boolean disjunction |
BinaryOp(Implies, _, _) | Required for invariants and ensures |
BinaryOp(Eq, _, _) | On Int and on entity-typed values |
BinaryOp(Lt, _, _) | LIA comparison |
BinaryOp(In, _, _) | Membership in a state relation domain |
UnaryOp(Not, _) | Boolean negation |
UnaryOp(Negate, _) | LIA negation (0 - x) |
Quantifier(All, _, _) over enums | Universal binding |
Let(_, _, _) | Local binding |
IntLit, BoolLit, Identifier | Atoms |
Plus the IR top-level shells:
EntityDecl(flat, no inheritance, no per-field constraints)EnumDeclStateDeclwith scalar fields and simple domain-typed relations onlyOperationDeclwithrequiresandensures(no state mutation in M_L.1; M_L.2 addsPrime/Pre)InvariantDecl(single conjunctive predicate)
6.2 Operators explicitly out of scope (deferred)
| Excluded | Rationale |
|---|---|
BinaryOp(Add|Sub|Mul|Div, ...) | Defer to M_L.2; need carrier-set proof for Int |
BinaryOp(Subset, ...) | State-relation identifier subset is in subset; arbitrary set subset remains deferred |
BinaryOp(Union|Intersect|Diff, ...) | Closed for set-valued expressions in issue #195 |
UnaryOp(Cardinality) | Currently only on state relations; defer to state-mutation milestone |
UnaryOp(Power) | Translator already raises TranslatorError (undecidable in FOL), permanent exclusion |
Quantifier(_, _, _) over entity collections | Defer to M_L.2 (needs frame axioms) |
SetComprehension, MapLiteral, SeqLiteral | Out of scope until collections milestone |
SetLiteral | Closed for finite literals in issue #195 |
If, Lambda, Constructor, SomeWrap, The, NoneLit | Translator already raises TranslatorError for most |
Index, Call, EnumAccess (dynamic), With, Matches | Defer; require advanced encoding |
Prime, Pre | M_L.2 adds two-state coupling |
FieldAccess | M_L.2 (records subsumed by entity decls) |
TransitionDecl, TemporalDecl, FunctionDecl, PredicateDecl | Out of scope; lives in separate temporal/derived-logic milestones |
This subset hits ~10 operators, ~20 LOC of Lean per operator for the
denotation, and admits an automatic-decidability story (every quantifier in scope
is over a finite domain, so eval returns Bool instead of Option Bool for
many branches).
6.3 Sort encoding in M_L.1's translator
Mirroring Translator.scala choices, simplified:
- Bool, Int: SMT-LIB native sorts.
- Enums: uninterpreted sort + member constants + distinctness axiom. Cardinality finite and known.
- Entities: uninterpreted sort + per-field accessor function (
Entity_field : Entity → FieldSort). No datatype constructors in M_L.1 (avoids cvc5-Alethe blocker; matches Cohen's "ADT axiomatization" pattern). - State: tuple of scalar functions (no maps, no relations beyond domain membership predicates) in M_L.1. M_L.2 expands.
7. Picking a proof assistant
7.1 Three candidates compared
| Criterion | Lean 4 + mathlib4 | Isabelle/HOL + AFP | Coq/Rocq |
|---|---|---|---|
| Z3 reconstruction available? | No (lean-smt is cvc5-only) | No (Sledgehammer Z3 is legacy; ITP 2025 work targets cvc5/veriT) | No (SMTCoq targets veriT/cvc5) |
| Closest prior-art language | AliveInLean (LLVM peepholes) | TLA*, Z, Object-Z, Cohen's framework if ported | Cohen's Why3-O (the Coq variant) |
| Toolchain churn | Quarterly (Lean 4.27→4.28→4.29→4.30 in 14 weeks Feb-Apr 2026) | Yearly Isabelle releases; AFP push-through | Yearly Coq → Rocq transition; stable post-2025 |
| Long-lived single-author projects | Few (ecosystem <5 years) | CryptHOL (9 yrs), TLA* (14 yrs), HOL-Z (25 yrs) | CompCert (20 yrs) |
| Scala-team learning curve | Lowest (most syntactically similar to Scala 3) | Medium (Isar prose-style proofs unfamiliar) | Medium-high (tactic style, ssreflect) |
| Mathlib4 record / Finset support | Excellent (structure, Finset α, decidability) | Excellent (record, Set, fset library) | Good (Record, MSet, but more boilerplate) |
| Risk of mathlib churn breaking proofs | High | Low (definitional shallow embeddings rarely break) | Low |
7.2 Decision: Isabelle/HOL (post-#193 pivot)
Active proof assistant: Isabelle/HOL (Isabelle2025-2). The original recommendation was Lean 4 (see §7.4 historical note); the project pivoted to Isabelle/HOL on 2026-05-04 after the M_L.0-M_L.5 first-ship gate landed. Tracking issue: #193.
Why Isabelle won the rematch:
- Production-grade Scala extraction:
Code_Target_Scala(in HOL-Library) ships Lean's missing piece. The verifiedtranslate,eval, andsmt_evalfunctions extract directly to ~1.4 kLoC of idiomatic Scala 3. The Scala layer's translator is no longer hand-written, it is the extracted Isabelle definition. - Toolchain stability: Isabelle releases yearly with AFP push-through migration. Lean 4 ran 4.27 → 4.30 in 14 weeks Feb-Apr 2026, quarterly churn is expensive at single-author scale.
- Stronger automation on the proof side: the universal soundness theorem
closes in ~600 LoC of Isabelle (
apply (cases ...; auto)+ per-case*_steplemmas) vs ~5400 LoC of Lean's case-bashing. - Cleaner TCB: Isabelle kernel + Z3 driver + extracted-Scala output. No
Lean.ofReduceBoolaxiom, nocert_decideper-run native compilation.
Active configuration:
lean-toolchainretired;proofs/isabelle/SpecRest/lean-toolchaindoesn't exist. Isabelle pin lives at the host level (Isabelle2025-2 installed via the system'sisabellebinary).- No mathlib analog: imports are restricted to
Main+HOL-Library. AFP entries are only imported when load-bearing (none currently). - Records use
[code]-markeddefsto bypass the polymorphic-scheme codegen barrier observed during Phase 5 of #193.
7.4 Historical: Lean 4 was the original choice
The pre-pivot recommendation cited Lean 4's smallest learning-curve gap from
Scala 3 and the AliveInLean precedent. Both arguments still hold; they were
outweighed by the lack of a production Scala extractor in Lean's ecosystem
and the quarterly toolchain churn. The Lean track at proofs/lean/ was
retired with the pivot, see #193
for the full decision record and #202
for the related (deferred) IR-canonicalization follow-up.
7.3 Embedding shape (locked across all three candidates)
- Source IR (
Expr,TypeExpr, declarations): deep embedding as inductive types mirroringmodules/ir/.../Types.scala1:1. Required because the soundness theorem∀ e. denote(translate(e)) = eval(e)quantifies overesyntactically. - Semantic domain: shallow.
Int → Lean Int,Bool → Lean Bool,Set α → Mathlib Finset α(or coreListif we sidestep mathlib), entity sorts as opaque type variables. - SMT-LIB target: shallow. Interpret SMT terms directly as
Prop, no need for meta-reasoning over SMT syntax, only over IR syntax.
This is the standard hybrid Why3-in-Coq, AliveInLean, and Concrete Semantics IMP all use (Gibbons & Wu; Annenkov & Spitters).
8. Milestone decomposition
Five milestones. Each has a contributor sign-off gate before the next opens.
8.1 M_L.0, scope, scaffolding, contributor handoff
Effort: ~1 week part-time. Lands when a contributor signs up for M_L.1.
Deliverables:
- This research doc (you're reading it).
- A
proofs/lean/directory with a minimallakefile.toml, an emptySpecRestnamespace, and a one-pageREADME.mdlinking back here. - A pinned
lean-toolchain(recommend latest stable Lean 4 release as of M_L.0 start, e.g.,leanprover/lean4:v4.30.0). - A
.github/workflows/lean.ymljob that runslake build. Off the critical path (separate matrix); failure does not block PRs. - An initial PR-template note that anyone touching
modules/ir/.../Types.scala'sExprADT should add a TODO entry toproofs/lean/SpecRest/IR.lean.todo.
Acceptance: empty Lake project compiles green in CI.
8.2 M_L.1, IR denotational semantics for the verified subset
Effort: 6-10 person-weeks at expert rates; 12-20 at non-expert.
Deliverables:
proofs/lean/SpecRest/IR.lean: deepExpr,TypeExpr,EntityDecl,EnumDecl,StateDecl,OperationDecl,InvariantDecl, restricted to the §6.1 subset.proofs/lean/SpecRest/Semantics.lean:ValueADT,Env,State,eval : Env → State → Expr → Option Value. Per-operator denotation lemmas.proofs/lean/SpecRest/Examples.lean: round-trip tests forsafe_counter.specfragments (parsed by hand for now; M_L.4 wires the parser).
Acceptance:
- All denotation lemmas closed (no
sorry). safe_counter.specinvariantcount ≥ 0evaluates toTrueunder a hand-built initial state.
LOC estimate (per A5 calibration, anchored to Concrete Semantics IMP):
| Component | Lean LOC |
|---|---|
Inductive Expr + TypeExpr | 80 |
Value + Env, State | 80 |
eval core | 150 |
| Quantifier + decidability | 120 |
| OperationDecl + InvariantDecl | 80 |
| Per-operator denotation lemmas (10 × ~30) | 300 |
| Round-trip examples | 100 |
| Total | ~900 |
(A5 reported ~1,350 with collections+records; M_L.1 trims those, dropping ~450 LOC.)
8.3 M_L.2, translator soundness theorem for the verified subset
Effort: 3-5× M_L.1, i.e. ~6-12 person-months.
Deliverables:
proofs/lean/SpecRest/Smt.lean: shallow embedding of the SMT-LIB fragment we emit (Bool ops, LIA, UF, bounded quantifiers).proofs/lean/SpecRest/Translate.lean: Lean version of the Scala translator, function-by-function mirror of the §6.1 cases inz3.Translator.scala.proofs/lean/SpecRest/Soundness.lean: theorem∀ e, well_typed e → eval e = smtEval (translate e).- An audit appendix in
proofs/lean/README.mdmapping each Lean translation case to the corresponding Scala line range for human cross-checking.
Acceptance:
Soundness.leancloses with nosorryfor the §6.1 subset.- A Scala test (
modules/verify/.../TranslatorAuditTest.scala) checks that anyExprcase marked "in the verified subset" still has a counterpart inTranslate.lean.
Trust assumption (after M_L.2):
- Lean's kernel.
- The shallow SMT-LIB embedding's accuracy (auditable in <100 LOC).
- The hand-written Scala translator matches the Lean
translatefunction case-for-case (audited; if it diverges the Scala test fires). - Z3 itself.
8.4 M_L.3, translation validation (per-run certificate emission)
Effort: ~3-6 person-months, parallel to M_L.2.
Deliverables:
modules/verify/src/main/scala/specrest/verify/cert/Emit.scala: emits a Lean source file<spec>.cert.leanperverifyrun, containing onetheorem cert_<id> : eval ir = smtEval smt_ir := by ...per check.proofs/lean/SpecRest/Cert.lean: tactic library (or simp set) that closes acert_<id>goal via a fixed proof script when the IR is in the verified subset.verify --emit-cert <dir>CLI flag, mirroring the existing--dump-vc.
Acceptance:
- For every fixture in
fixtures/spec/whoseExpris in the verified subset,verify --emit-cert /tmp/c && lake build /tmp/csucceeds. - Out-of-subset fixtures emit a
cert_<id> := sorryplaceholder with a comment pointing at the offending case.
Why parallel to M_L.2: M_L.3 only needs M_L.1's semantics, not M_L.2's universal-quantifier theorem. M_L.3 ships earlier trust gain; M_L.2 generalizes it.
8.5 M_L.4, subset expansion
Open-ended. Add Add/Sub/Mul/Div, then collections, then state mutation,
then quantifiers over entity collections. Each expansion follows the M_L.1 + M_L.2
template: extend the deep IR, extend eval, extend translate, extend the
soundness theorem.
Sequencing recommendation:
- LIA arithmetic (
Add,Sub,Mul,Div), needed by ~70% of real specs. - State mutation (
Prime,Pre,OperationDeclwith state updates). - Quantifiers over fixed-size entity collections (with frame axioms).
- Finite collections (
Set,Map,Seqwith bounded ops). - Records /
Withupdates /FieldAccess. Constructor,EnumAccess(dynamic).
Each item is its own multi-month milestone; punt indefinitely if the §5.1 path-A certificates make it unnecessary for any real consumer.
9. Risks
9.1 Doc rot
This doc anchors to modules/ir/.../Types.scala's 25-case Expr ADT. If Expr
evolves (new cases, renamed cases, restructured sums), the verified subset table
in §6 drifts. Mitigation: a PR-template note (added in M_L.0) reminding
contributors to update §6 when touching Expr. A lint pass in M_L.2 flags
Expr cases without a corresponding proofs/lean/SpecRest/IR.lean mirror.
9.2 Z3 verdict drift across versions
Path B (meta-soundness) does not bind us to a Z3 version, the soundness theorem
says "if Z3 returns unsat, the IR property holds." Path A (per-run certs) is
similarly version-agnostic. However, an "external solver agreement" CI job
(running cvc5 in parallel and alerting on disagreement) is cheap defense in depth
and belongs in a #77 follow-up. Out of scope here.
9.3 Lean toolchain churn
Lean shipped 4.27, 4.28, 4.29, 4.30 in 14 weeks (Feb-Apr 2026). Each release can
break dependent projects. Mitigation: pin lean-toolchain; bump only
quarterly; avoid mathlib4 dependency (M_L.1 needs none). The "every 6-12 months,
1 day to bump" maintenance cost is acceptable; the "every 3 weeks, scramble
because mathlib reorganized" cost is not.
9.4 Contributor abandonment mid-milestone
The work is part-time, opportunistic. A contributor disappearing mid-M_L.1 strands ~500 LOC of half-finished Lean. Mitigation:
- M_L.0 only opens the scaffolding when a contributor commits to M_L.1.
- M_L.1 is broken into 8 phases (per A5 §4) each independently mergeable.
- A
proofs/lean/STATUS.mdfile tracks per-operator completion so a successor can resume.
9.5 The semantics-layer trap (cohen)
Cohen's PhD spent more time on the semantics layer than on the translator transformations. We mitigate by deliberately picking the smallest non-trivial subset (§6.1) and resisting expansion until M_L.1 + M_L.2 ship. A "verified subset" at 10 operators is more valuable than a half-finished verified subset at 25 operators.
9.6 The f* trap
F*'s SMT encoder has been "open" for 10+ years. Their failure mode: refinement
types + monadic effects too rich for the formalization to remain tractable. We
avoid this by excluding rich corners of the IR upfront (§6.2 lists every
deferred operator with a why) and by treating TranslatorError-raising operators
as permanent exclusions (Power, Lambda, If, see §6.2).
9.7 Mismatch between lean translate and Scala translator.scala
Path B proves the Lean translate function sound. Production uses Scala
Translator.scala. Mitigation: M_L.2 ships an audit appendix mapping each
Lean case to the corresponding Scala line range; TranslatorAuditTest.scala
machine-checks that the verified subset's case set in Lean equals that in Scala.
If they diverge (Scala adds a case Lean lacks, or vice versa), the test fires.
The full "extract Scala from Lean" approach (Why3-O, Cohen 2025) is out of scope, it would more than triple the M_L.2 cost. We accept the audit-by-test mitigation as good-enough.
10. Non-goals
Inherits #88's non-goals, plus:
- Verifying the Alloy translator. Same approach, separate effort. Alloy semantics in any prover does not exist, would be novel work twice over.
- Verifying the parser. Treat the parser CST as given; this work targets IR → solver only.
- Verifying Z3 itself. Z3 remains in the trust base. SMTCoq-style proof replay is blocked at the Z3 side (see §3) and a switch to cvc5 is a separate multi-month rewrite (§3.2).
- Verifying the SMT-LIB serializer (
SmtLib.scala). Trivially small, audited by inspection, in the TCB. Cohen made the same call. - Ahead-of-time correctness for every spec construct. Aim for a verified subset that grows; full coverage is M_L.4-and-beyond, possibly never.
- Human-readable proof narration of soundness violations. That's #20's territory ("operation X violates invariant Y because…"). Soundness proof failures should not happen in production after M_L.2; if they do, the bug is in Lean, rather than in user output.
11. References
11.1 Closest prior art
- Cohen, A Foundationally Verified Intermediate Verification Language (Princeton PhD, May 2025), code at github.com/joscoh/why3-semantics
- Cohen, A Formalization of Core Why3 in Coq (POPL 2024)
- Parthasarathy et al., Towards Trustworthy Automated Program Verifiers (PLDI 2024)
- Parthasarathy et al., Formal Foundations for Translational Separation Logic Verifiers (POPL 2025)
- Nezamabadi/Myreen/Tan, Verified VCG and Verified Compiler for Dafny (Dec 2025)
11.2 SMT proof reconstruction in proof assistants
- Lachnitt et al., Improving the SMT Proof Reconstruction Pipeline in Isabelle/HOL (ITP 2025)
- Schurr/Fleury, Reconstructing veriT Proofs in Isabelle/HOL (PxTP 2019)
- lean-smt: An SMT Tactic for Discharging Proof Goals in Lean (CAV 2025)
- cvc5 → Isabelle reconstruction blog (2024)
- Lachnitt et al., IsaRare: Automatic Verification of SMT Rewrites in Isabelle/HOL (TACAS 2024)
- Böhme, Proof Reconstruction for Z3 in Isabelle/HOL (PhD)
- SMTCoq
11.3 Z3 proof formats
- de Moura, Bjørner, Proofs and Refutations, and Z3 (IWIL 2008)
- Z3 release notes
- Z3 discussion #4881, On proof generation and proof checking
- Carcara: Proof Checker for Alethe (TACAS 2023) and github.com/ufmg-smite/carcara
- Alethe specification and arXiv 2107.02354
- cvc5 Alethe proof output
- Reynolds, Selecting Quantifiers for Instantiation in SMT (SMT 2023)
11.4 Spec-language semantics formalizations
- Grov & Merz, TLA in Isabelle/HOL (AFP, 2011-2025)
- Brucker/Rittinger/Wolff, HOL-Z (Z notation in Isabelle)
- Lamport/Merz/Newcombe, The Future of TLA+ (2024)
- Aguirre/Hriţcu, Towards a Provably Correct Encoding from F* to SMT (MS thesis)
- Astra, Evaluating Translations from Alloy to SMT-LIB (UWaterloo, 2019)
- Pierce et al., Software Foundations, Hoare
- Nipkow & Klein, Concrete Semantics
11.5 Embedding-style references
- Gibbons & Wu, Folding Domain-Specific Languages: Deep and Shallow Embeddings
- Annenkov & Spitters, Deep and Shallow Embeddings in Coq
- Verifying Programs with Logic and Extended Proof Rules: Deep vs. Shallow Embedding
11.6 Lean 4 / mathlib4 ecosystem
11.7 Spec_to_rest cross-references
docs/content/docs/research/06_spec_verification.md, verification pipeline designdocs/content/docs/spec-language.mdx, current English-prose semantics- Issue #88, Mechanically verified translator soundness
- Issue #77, Proof-certificate / unsat-core export (closed)
- Issue #20, M4.4 error reporting + spans
A. codebase translator coverage (april 2026)
Snapshot (April 2026) of verify/z3/Translator.scala
(1917 LOC) measured against the 25-case Expr ADT that lived in the (now-deleted) hand-written
modules/ir/src/main/scala/specrest/ir/Types.scala. Used to define the verified subset in §6.
Post-#202 the canonical IR is the extracted expr_full (27 ctors) in
ir/generated/SpecRestGenerated.scala, with the verified
subset surfaced as expr in the same file; this appendix is preserved as a historical
snapshot of the pre-canonicalization translator coverage.
Expr case | Translator status | Notes |
|---|---|---|
BinaryOp(And|Or|Implies|Iff) | full | direct mapping (lines 629-633) |
BinaryOp(Eq|Neq) | full | dom/set-comprehension equality special-cased (616-621), fallback Z3Expr.Cmp (643) |
BinaryOp(Lt|Gt|Le|Ge) | full | CmpOp mapping (634-643) |
BinaryOp(In|NotIn) | full | membership via state relation dom, set literal, set membership (622-757) |
BinaryOp(Subset|Union|Intersect|Diff) | full | with sort validation; fails on non-SetOf (661-693) |
BinaryOp(Add|Sub|Mul|Div) | partial | integers only; fails on string/set arithmetic (650-653) |
UnaryOp(Not) | full | direct Z3Expr.Not (866) |
UnaryOp(Negate) | full | as 0 - operand (867-868) |
UnaryOp(Cardinality) | partial | only on state-relation identifiers (876-881) |
UnaryOp(Power) | errored | "powerset operator is not decidable in first-order SMT" (870-874) |
Quantifier | full | ∀/∃ supported; ∄ encoded as ¬∃ (905-930) |
SomeWrap | errored | catchall (597-601) |
The | errored | catchall (597-601) |
FieldAccess | full | via entity field functions (971-995) |
EnumAccess | full | via enum member constants (1132-1142) |
Index | partial | only on state-relation references (997-1009) |
Call | partial | only identifier callee; hardcoded builtins (len, isValidURI); higher-order fails (1011-1032) |
Prime / Pre | full | state-mode switching for post/pre-state (589-592) |
With | full | record update via Skolem constant + equality constraints (1061-1098) |
SetComprehension | errored | only allowed inside membership; standalone fails (1100-1105) |
SetLiteral | partial | non-empty literals; empty literals require type context and are covered in membership contexts |
MapLiteral | errored | catchall (597-601) |
SeqLiteral | errored | catchall (597-601) |
Matches | full | as uninterpreted predicate, mangled by pattern + arg sort (1037-1049) |
IntLit, BoolLit, StringLit | full | direct literals or constants (577-579) |
NoneLit | errored | catchall (597-601) |
Constructor | errored | catchall (597-601) |
If | errored | catchall (597-601) |
Lambda | errored | catchall (597-601) |
Let | full | environment extension (1051-1059) |
Summary: 13 fully handled, 8 partial, 4 errored. The verified subset (§6.1) draws exclusively from the "fully handled" column, restricted further to operators whose Lean denotation is <30 LOC each.
Sort-encoding decisions mirrored in M_L.1's IR.lean:
- Entities: uninterpreted sort
U:EntityName+ accessor functionsEntityName_fieldName : EntityName → FieldSort(319-328). - Enums: uninterpreted sort + member constants + distinctness axioms (278-283).
- Options: flattened to inner type (
OptionType(T)→T, line 541). - Sets: SMT-LIB
(Set elemSort)with store/select (SmtLib.scala:66-72). - Strings: uninterpreted sort + fresh constants per literal + distinctness (529-537).
- State relations (RelationType/MapType): pair of functions
<field>_dom : K → Booland<field>_map : K → V(465-495). - Scalar state fields: single constant function
state_<field> : () → T(497-501). - Post-state: same encoding with
_postsuffix; toggled viactx.stateMode(39-40, 477-478, 503-510, 589-592, 603-607).
12. Governance, proof-owned surfaces and change process
Originally
11_global_proof_governance.md. Issues #170, #171, #172, #174, #175.
12.1 Why governance exists
#88 is only tractable if the proof target stops moving invisibly. The failure mode is not
"someone merges a bad theorem", it is Types.scala, the Z3 translator, or the
routing/orchestration contract changing while the proof effort still assumes the old shape.
That is how a global-proof project turns into a long-lived side quest with no honest ship
claim.
M_G.1 therefore governs the proof target before proofs/lean/ exists. The goal is not to
freeze the whole repo, the goal is to make target movement explicit.
12.2 Governance mode
The current mode is controlled churn. The repo is not in a hard IR freeze yet. Changes to proof-governed surfaces are still allowed, but those changes must carry their proof impact in the same PR.
12.3 Proof-owned surfaces (master list)
| Class | Surface | Why it is governed |
|---|---|---|
| Proof-owned core | proofs/isabelle/SpecRest/IR.thy (extracted to ir/generated/SpecRestGenerated.scala) | Defines expr/expr_full/type_expr_full/service_ir_full, the AST shapes the proof mirrors. Since #202 the Scala IR is auto-extracted; do not hand-edit the generated file. |
| Proof-owned core | verify/z3/Translator.scala | Main translation function; prover-side mirror tracks case-for-case. |
| Proof-owned core | verify/z3/Types.scala | Z3Script, Z3Expr, artifact structures in the first theorem target. |
| Proof-owned core | proofs/isabelle/SpecRest/{IR,Semantics,Smt,Translate,Soundness,Codegen}.thy | The canonical proof track. Universal soundness theorem closes with zero sorries. |
| Drift-control artifact | verify/test/cert/generated/A8RoundTripOracleTest.scala | Round-trip oracle: every canonical probe runs through extracted lower → translate. Since #202 close-out, the projection is the extracted lower itself, no hand-rolled mirror remains. |
| Obligation contract | verify/Classifier.scala | Decides which checks are in the Z3 proof scope. |
| Obligation contract | verify/Consistency.scala | Defines the operational meaning of global, requires, enabled, preservation. |
| TCB-sensitive | parser/Parse.scala | Parser remains trusted; changes can narrow/widen the honest claim. |
| TCB-sensitive | parser/Builder.scala | IR construction remains trusted. |
| TCB-sensitive | verify/z3/Backend.scala | Runtime renderer from Z3Script to Z3 ASTs. |
| Proof-owned CI | .github/workflows/lean-certs.yml | Sidecar matrix: per fixture, verify --emit-cert + lake build. Six fixtures as of M_L.4.a-k. |
| Proof-state ledger | proofs/lean/STATUS.md | Per-Expr-case mirror; PR template enforces re-sync on Expr changes. |
| PR contract | .github/PULL_REQUEST_TEMPLATE.md | Carries the Expr-touch reminder fanning out to all of the above. |
Two non-members:
verify/z3/SmtLib.scala, export-only, rather than on the runtime path.verify/certificates/Dump.scala, affects artifacts, rather than the theorem target.
12.4 Required change process
Any PR touching a proof-governed surface must:
- Update §13 Live Status Ledger in the same PR.
- Classify the change: proof-target shape change / obligation-routing change / TCB-only change / program-commitment change.
- State whether the M_G.0 theorem statement or TCB summary changed.
- If proof-safe profile membership changed, update §14 profile.
- If owner/priority/paused-work/stall-rule changed, update §15 runway.
- If activation state or kickoff shape changed, update §16 activation.
12.5 Freeze states
- Ungoverned, normal repo churn (no proof-governed surface touched).
- Governed, current state for §12.3 surfaces. Move freely with logging.
- Frozen, temporary state for the M_L.2 closure window only. Surface may only change with matching proof update.
13. Live status ledger
Originally
12_global_proof_status.md. Issues #172, #174, #175.
13.1 Current baseline (post-M_L.4.a-k), first-ship gate met
-
Governance mode: execution track active; M_L.2 universal soundness closed for the §6.1 verified subset (zero
sorry). M_L.4.a/b/c/d/e/f/g/h all merged. -
First-ship claim status: MET as of 2026-05-02. The single-state idiom of the spec language (atoms, identifiers, scalar reads, all logical/arithmetic/comparison operators, state-relation membership / cardinality / Index / forall, FieldAccess (any entity-valued base, bare ident, Index result, chained
.f1.f2, quantifier-bound vars), single-statePrime/Precollapse, enum quantifiers and theirSome/No/Existscomposition aliases, NotIn composition, Subset over rel-identifiers via composition) is mechanically validated against the Z3 translator. The deployable claim:For every
ServiceIRwhose invariants and operationrequiresclauses fall in the §6.1 verified subset (now extended through M_L.4.k),z3.Translator.scala's output matches the Leantranslatefunction case-for-case, and the Lean meta-theoremSpecRest.soundnessproves that translation correlatesevalwithsmtEvalunder the correlatedSmtModelandSmtEnv. UNSAT verdicts on translated obligations therefore reflect properties of the spec, rather than just artifacts of the translator.Trust closure: M_L.1 IR/Semantics axioms + the
Lean.ofReduceBoolaxiom thatnative_decide(used by per-run M_L.3 certs) introduces. -
First theorem target: in-memory
ServiceIR → Z3Scriptpath used byConsistency.runConsistencyChecks. -
Active proof-safe profile: §14, see §14.4 for the per-
Exprcase ledger. -
Per-run translation-validation certs (M_L.3): working in CI matrix (
.github/workflows/lean-certs.yml× 6 fixtures). Six fixture certslake buildclean;safe_counter3/3 cert_decide,set_ops5/11,todo_list4/17,edge_cases8/15,url_shortener1/7,auth_service0/21 (z3 backend errors unrelated to subset). Stub reasons remaining:Callbuiltins (len), multi-binding quantifiers, strings and collection shapes outside finite set literals, operation-input env binding, two-state preservation. None are soundness gaps, they are out-of-subset shapes that emittheorem cert : True := trivialwith aTODO[M_L.4]marker and zerosorry. Nested FieldAccess (users[uid].email,forall t in tasks, t.field, chained.f1.f2) is no longer in this list, M_L.4.k brought it into the verified subset via the entity-id-keyed carrier. -
Proof owner: HardMax71.
-
Runway: initial six-week M_G.3 cycle consumed; subsequent cycles are unscheduled. See §15 for stall rule.
-
Outside the first-ship claim (still genuinely deferred):
SmtLib.scala, dump/export paths, Alloy-routed checks, proof replay, full-source semantics refinement, true two-state semantics forPrime/Pre(single-state collapse only, M_L.4.b-ext gates real preservation reasoning),Withrecord-update (bundled with M_L.4.b-ext), map/sequence literals, set comprehensions, strings,Call/Matches, multi-binding quantifiers.
13.2 Status labels
| Label | Meaning |
|---|---|
tracked | Surface is governed; changes must be logged. |
mirrored | Prover-side mirror exists; soundness theorem incomplete. |
fragment-proved | A bounded fragment is mechanically proved. |
ship-claim-ready | Surface is covered by the first honest public claim. |
reserved | Planned but not yet active. |
13.3 Governed surface ledger
The full ledger is maintained in proofs/lean/STATUS.md (per-Expr case granularity). The
high-level Scala side is in §12.3.
13.4 Update rules
When the ledger changes, the entry should say at least:
- which governed surface moved,
- whether the move changed syntax / semantics / routing / TCB,
- whether the M_G.0 theorem statement still reads honestly afterward.
14. Proof-safe profile and backend contract
Originally
13_global_proof_profile.md. Issue #173. The committed first scope the global-proof program ships against. Updated post-M_L.4.a-k.
14.1 Decision summary
The global-proof program does not start from "all current language features." The committed
profile is Z3-Core, a Z3-only fragment of the IR, and within it the bootstrap
implementation slice Z3-Core-1S (one-state).
14.2 Stage labels
| Stage | Meaning |
|---|---|
bootstrap | In scope for Z3-Core-1S. |
first ship | Required before the first public claim is honest. |
defer | Out of first ship; can enter only by explicit profile expansion. |
exclude | Outside the Z3 theorem track entirely. |
14.3 Declaration-level profile
| Construct | Stage | Rule |
|---|---|---|
EnumDecl | bootstrap | Finite domains; first bounded-quantifier story. |
EntityDecl | bootstrap | Flat entities only. |
StateDecl | bootstrap | Scalar fields and simple relation/map fields over profile-safe types. |
InvariantDecl | bootstrap | Body must be in-profile. |
OperationDecl | bootstrap | Inputs and requires in scope. ensures partial under single-state collapse (M_L.4.b); full two-state is M_L.4.b-ext. |
TypeAliasDecl, FactDecl, FunctionDecl, PredicateDecl, TransitionDecl, ConventionsDecl | defer | |
TemporalDecl | exclude | Always Alloy-routed; outside Z3 theorem. |
14.4 Expression-level profile (post-M_L.4.a-k)
Expr case | Stage | Rule / reason |
|---|---|---|
BinaryOp(And | Or | Implies | Iff) | bootstrap | Core propositional layer. Soundness: M_L.2 closure. |
BinaryOp(Eq | Neq | Lt | Gt | Le | Ge) | bootstrap | Core comparison layer. Soundness: M_L.2 closure. |
BinaryOp(In) | bootstrap | State-relation domain membership. Soundness: M_L.2 closure. |
BinaryOp(NotIn) | bootstrap | M_L.4.e closed via emitter-side composition: NotIn(e, r) ≡ ¬In(e, r). |
BinaryOp(Add | Sub | Mul | Div) | bootstrap | M_L.4.a closed. Div-by-zero policy: eval returns none. |
BinaryOp(Subset) over state-relation identifiers | bootstrap | M_L.4.i closed via emitter-side composition: Subset(r1, r2) ≡ ∀ x ∈ r1, x ∈ r2. |
BinaryOp(Union | Intersect | Diff) | bootstrap | Issue #195 closed. Set-valued algebra over Value.vSet / SmtVal.sSet; non-set operands reduce to none. |
UnaryOp(Not | Negate) | bootstrap | M_L.2 closed. |
UnaryOp(Cardinality) | bootstrap | M_L.4.c closed. Restricted to state-relation identifiers (mirrors Translator.scala:876-881). |
UnaryOp(Power) | exclude | Routed to Alloy. |
Quantifier(All) over enum identifier | bootstrap | M_L.2 closure: per-case + universal soundness. Single binding over enum-name identifier. |
Quantifier(All) over state-relation identifier | bootstrap | M_L.4.f closed. New forallRel Lean constructor + soundness_forallRel_known per-case theorem; EvalIR.State.demo populates relations with empty domains so quantifier is vacuously true. |
Quantifier(Some | No | Exists) | bootstrap | M_L.4.d (enums) + M_L.4.f (state-rels) closed via emitter-side composition: ∃ x, P ≡ ¬ ∀ x, ¬ P. |
SomeWrap, The | defer | Option/choice semantics. |
Index (state-relation pair lookup) | bootstrap | M_L.4.g closed. Strictly-additive State.lookups/SmtModel.predLookup pair table; new Expr.indexRel + SmtTerm.indexRel + lookupKey_correlated bridge. Restricted to identifier-base. |
FieldAccess (entity-valued base, any shape) | bootstrap | M_L.4.h + M_L.4.k closed. Entity-id-keyed State.entityFields / SmtModel.predFields table. M_L.4.h covered the bare-Identifier base; M_L.4.k generalises the constructor to take an arbitrary Expr base (Index, chained FieldAccess, quantifier-bound vars). Mirrors Translator.scala:981-1005. |
EnumAccess | bootstrap | Closed via SmtTerm.enumElemConst. |
Call | defer | len/isValidURI/dom need per-builtin semantics. |
Prime, Pre | bootstrap (single-state collapse) | M_L.4.b closed as identity. True two-state semantics is M_L.4.b-ext. |
With | defer | Identity-collapse would emit false certs. Requires Skolem mirror + StatePair. |
If | defer | Needs product / decidable encoding. |
Let | bootstrap | M_L.2 closure. |
Lambda | defer | Outside FOL. |
Constructor | defer | Constructor semantics deferred. |
SetLiteral | bootstrap | Issue #195 closed. Non-empty standalone literals and empty set-literal membership render as nested .setInsert over .setEmpty. |
MapLiteral, SeqLiteral, SetComprehension | defer | Collections deferred. |
Matches | defer | Regex/string semantics deferred. |
IntLit, BoolLit, Identifier | bootstrap | M_L.2 closed. |
FloatLit, StringLit, NoneLit | defer | No committed solver semantics. |
14.5 Backend contract
| Question | Decision | Consequence |
|---|---|---|
| Which solver remains trusted in the first ship? | Z3 | First theorem is Z3-trusting; no proof replay. |
| Backend-agnostic theorem? | No | Target is the current Z3 path. |
| Alloy in scope? | No | Anything routed to Alloy is outside Z3-Core. |
SmtLib.scala inside the first theorem? | No | Runtime path uses Z3Script + WasmBackend. |
| Proof export / replay in scope? | No | Separate translation-validation track (M_L.3, shipped). |
| cvc5 cross-checking? | No | Defense in depth only. |
14.6 Actual coverage after M_L.4.a-k
The originally-targeted Z3-Core-1S slice was: global and requires checks only; no
Prime/Pre/With/Cardinality; no collections, strings, regex; quantifiers over
enums only.
The actual covered slice is wider: full propositional layer + cmp + LIA arithmetic
(Add/Sub/Mul/Div with Div-by-zero none policy) + let + enumAccess + state-relation
membership (In) + all four quantifier kinds (All/Some/No/Exists) over enum-name
identifiers + Prime/Pre (single-state collapse) + cardRel. Universal soundness theorem
closes for this whole slice with zero sorry.
Still deferred: arbitrary collection algebra outside issue #195 set-valued
Union/Intersect/Diff, map/sequence literals, set comprehensions, strings, Call, Matches,
If, Lambda, Constructor, two-state Prime/Pre, With.
This widened slice does not include real ensures-clause preservation reasoning, that
needs the StatePair refactor (M_L.4.b-ext, deferred). But it covers single-state invariants
and operation requires for the bulk of real specs.
14.7 Expansion rule
A feature may move from defer into a later profile only if all of the following are true:
- Source-level semantics are explicit enough to state the theorem honestly.
- Current Scala translator support is full, or the restriction is narrow enough to state precisely here.
- The feature stays on the Z3 path; otherwise it starts a separate theorem track.
- Both the prover-side semantics and the Scala mirror have at least one concrete fixture.
- This profile and
proofs/lean/STATUS.mdare updated in the same PR.
15. Runway and stall policy
15.1 Decision summary (M_G.3)
The global-proof program is an active, bounded priority, rather than background research.
- Owner: HardMax71.
- Runway: one uninterrupted six-week proof-priority cycle, consumed during M_G.4 activation through the M_L.4.a-d shipped batch.
- Scoping rule: fixed time, variable scope.
- Fallback: if a future cycle stalls, shrink to
Z3-Core-1S; if still stuck, switch primary trust-improvement back to expanded M_L.3 cert work.
15.2 Reprioritized roadmap while active
Lanes paused while a runway cycle is active:
| Lane | Issues | Rationale |
|---|---|---|
| LLM synthesis track | #27-#32 | Orthogonal; too large to run in parallel solo. |
| New target expansion / distribution | #33-#36, #56 | Widens surface area without helping proof ship. |
Lanes secondary-only (allowed if narrowly urgent):
| Lane | Issues | Rule |
|---|---|---|
| Auth/security | #53-#55 | Move only if urgent or blocks external need. |
| Maintenance and experiments | #149, #150, #161, #163 | Not allowed to displace proof time. |
Always allowed: build-break/CI fixes; correctness or security regressions in shipped behavior; narrowly-scoped doc fixes; parser/IR/translator changes that directly support the proof track and update governed docs.
15.3 Interrupt policy
The runway provides uninterrupted time, rather than at the cost of leaving the repo broken. Interrupts are allowed for: red CI; correctness bugs in current verification; security issues; narrowly-bounded maintenance. Interrupts are not a license to resume paused roadmap themes "for a day."
15.4 Circuit breaker
- At the end of a six-week cycle, do not auto-renew.
- If the cycle produced active artifacts but scope was too broad, shrink to
Z3-Core-1S. - If
Z3-Core-1Sstill fails, pause the universal-theorem push and switch primary trust goal to M_L.3-style cert work. - Keep any reusable semantics kernel artifacts; do not discard formalization just because the full theorem stalls.
"Meaningful theorem progress" means at least one of: a merged prover-side semantics artifact connected to live Scala structures; a merged mirror artifact covering real translator cases; a checked proof fragment stronger than scaffolding. Empty scaffolding, pinned toolchains without proof code, "proof soon" status notes do not count.
16. Activation record and kickoff shape
Originally
15_global_proof_activation.md. Issue #175. Closes the gap between readiness planning (M_G.*) and active theorem work (M_L.*).
16.1 Activation decision
M_G.4 activated the M_L.* execution track on 2026-05-01. As of that date:
- M_G.0 through M_G.3 are satisfied.
- HardMax71 is the active owner.
#126was unblocked as the first implementation issue.#127-#130remained gated only by predecessor milestones.
16.2 Gate review
| Gate | Artifact | Sufficient because |
|---|---|---|
| M_G.0, theorem statement and TCB | This doc §1 + issue #171 | First honest theorem target written down. |
| M_G.1, governed proof surfaces | This doc §12 | Target movement is visible. |
| M_G.2, proof-safe profile | This doc §14 | First theorem scope is smaller than full language. |
| M_G.3, owner / runway / fallback | This doc §15 | Named owner, bounded runway, paused lanes, circuit breaker. |
16.3 What got unblocked
After M_G.4, the dependency chain is:
#126(M_L.0) → active. Closed.#127(M_L.1), blocked on#126only. Closed.#128(M_L.2), blocked on#127. Closed for §6.1 subset, zero sorry.#129(M_L.3), blocked on#127. Closed.#130(M_L.4), blocked on#128. Sub-slices a-k plus issue #195 set algebra closed (LIA arithmetic, single-state Prime/Pre, Cardinality, enum quantifier composition, NotIn composition, state-relation quantifier, Index, FieldAccess on state scalars, Subset over rel-identifiers via composition, nested FieldAccess via entity-id-keyed carrier, set literals and set-valued Union/Intersect/Diff/membership). Remainder (With, true two-state,Call, multi-binding quantifiers, strings, maps/sequences, set comprehensions) deferred to later slices or M_L.4.b-ext.
16.6 First-ship gate met
As of 2026-05-02, the activation umbrella's success conditions are satisfied:
| Condition | Artifact |
|---|---|
| Stable theorem target | soundness theorem in proofs/isabelle/SpecRest/Soundness.thy (zero sorries; closes universally over the verified subset). |
| Explicit TCB | Isabelle/HOL kernel + IR.thy / Semantics.thy definitions + Z3 driver + extracted-Scala output. Per-run certs were deleted in Phase 6 (Option 2 of #193 review) so native_decide is no longer in the TCB. |
| Frozen / governed IR surface | Verified subset is lower's codomain in proofs/isabelle/SpecRest/IR.thy; both expr (subset) and expr_full (input) are extracted from the same theory, so drift between Scala and proof-side IR is structurally impossible. A8RoundTripOracleTest smokes the round-trip on the canonical probe corpus. |
| Proof-safe first scope | §14.4 verified-subset profile (post-M_L.4.a-k). |
| Active contributor commitment | M_L.0 → M_L.4.k shipped between PR #180 (2026-04-30) and #196 closure (2026-05-02). |
| Linked kickoff | M_L.0 PR #180 (combined M_L.0 + M_L.1 first slice). |
The proof program has moved from "research direction" to "shipped first-ship claim".
Remaining M_L.4 slices (Subset composition, multi-binding quantifier, Call(len) builtin)
are coverage uplifts, rather than theorem-program prerequisites. M_L.4.b-ext (true two-state
preservation) remains a multi-week effort that should be scheduled deliberately rather
than chipped at piecemeal.
16.4 First m_l PR shape (historical record)
The first M_L implementation PR was a combined M_L.0 + first M_L.1 slice rather than a standalone scaffolding PR. Minimum honest kickoff shape (delivered in PR #180):
proofs/lean/lean-toolchainpinned toleanprover/lean4:v4.29.1proofs/lean/{lakefile.toml, README.md, STATUS.md}proofs/lean/SpecRest/{IR.lean, Semantics.lean, Examples.lean, IR.lean.todo}.github/workflows/lean.yml
Required substance: a real deep embedding for Z3-Core-1S; a real semantic domain; at
least one checked example or lemma; a status table. Not acceptable: namespace-only
scaffolding; toolchain pin plus empty files; CI-only Lean setup with no semantics.
16.5 Activation invariants
proofs/lean/may not appear inmainuntil it lands with active proof content. (Now satisfied, M_L.0 onward.)- Changes to proof-governed Scala surfaces still require status updates in the matching proof docs (now §13 of this doc).
- If a future six-week runway fails to produce meaningful theorem artifacts, follow the circuit breaker in §15.4.
Activation is a commitment to start, rather than permission to drift.
17. IR canonicalization in Isabelle (issue #202, post-#193)
Status (2026-05-06): shipped. PR #204 (Phases 0–7 + 9 + 10a) merged
2026-05-06; this close-out PR extends lower v2 over QuantifierF (4
kinds, multi-binding) + multi-field WithF, deletes the hand-rolled
VerifiedSubset.classify and A8RoundTripOracleTest.toExtracted walker,
refreshes docs. The Scala IR is now canonically extracted from
proofs/isabelle/SpecRest/IR.thy; the audit oracle calls extracted lower
directly. Issue #202 closed.
17.1 Decision: C-hybrid
We are migrating to a single canonical IR ADT in Isabelle, with Scala-side types
regenerated via Code_Target_Scala. The chosen architecture is C-hybrid rather than
pure C ("regenerate Scala from existing Isabelle expr"):
- Existing
proofs/isabelle/SpecRest/IR.thyexpr(23-ctor verified subset) stays unchanged at the proof level, the 94 lemmas inSoundness.thyare not re-opened. - A second ADT
expr_full(27 ctors mirroring the full Scala input language including Float/String/None/Lambda/Call/Constructor/MapLiteral/SeqLiteral/Matches/SomeWrap/The/ SetComprehension and the broaderbin_op_full/un_op_fullenums) is added to the sameIR.thy. - A lower function
lower :: String.literal list ⇒ expr_full ⇒ expr optionprojects the full IR onto the verified subset; out-of-subset constructors becomeNone. The first argument is the list of declared enum names, needed because aQuantifierBindingFull v (IdentifierF dom _) _ _could resolve to eitherForallEnum(enum domain) orForallRel(relation domain), andlowercannot decide without schema context.loweris code-extracted; it replaced the hand-written ScalaVerifiedSubset.classifyand the test-sideA8RoundTripOracleTest.toExtractedwalker (both deleted in the #202 close-out). expr_fullis the canonical input-language ADT for all Scala consumers (parser, codegen, testgen, translator). They consume it via a thin Scala wrapper layer inmodules/ir/.../Types.scala(type aliases +apply/unapply+ extensions) that preserves the existing ergonomics.
17.2 Why not pure c
The verified-subset expr is not a renaming of the Scala input ADT, it is a strict
subset (23 vs 27 ctors, op-specialized vs op-parametric, no
Float/String/None/Lambda/Call/Constructor/MapLiteral/SeqLiteral/Matches/SomeWrap/The/
SetComprehension/In/NotIn/Subset). Replacing the Scala IR wholesale with the verified
subset would delete input-language features. C-hybrid keeps the verified subset stable
while introducing expr_full for the input language and lower as the proven
projection.
17.3 Span handling, option (a) inline
Every ctor of both expr and expr_full carries a final option_span (= span_t option) field, where span_t is a 4-int datatype. Span is Prop-erased, soundness
theorems do not inspect it; existing per-case lemmas absorb the new field as a wildcard.
No proof content changes; ~94 mechanical wildcard updates in Soundness.thy (Phase 2).
Rejected alternatives:
- Wrapper
Spanned[A], flattens inner-subexpression spans, regressing diagnostics. - Parametric
'a expr, re-triggers #193 Phase-5 record-polymorphism crash (Illegal fixed variable 'a'), well-known landmine.
17.4 No-new-files constraint
- All new Isabelle types and
lower/lower_reasonfunctions land in the existingproofs/isabelle/SpecRest/IR.thy(104 → ~750 LoC). Soundness.thyripple (option_span wildcards) edits in place acrossSemantics.thy,Smt.thy,Translate.thy,Soundness.thy.Codegen.thyexport_codelist extends in place.modules/verify/.../cert/generated/SpecRestGenerated.scalawas moved (not created) toir/generated/SpecRestGenerated.scalain #202 Phase 4, required to avoid anir → verifydep cycle whenirconsumers import the extracted types.modules/ir/.../Types.scalais rewritten in place into a wrapper layer (~600-800 LoC: type aliases,apply/unapply, extensions, givenCanEqual, hand pretty-printer).
17.5 Trigger
The 2026-05-02 trigger conditions in #202 §"Trigger conditions" (drift bug, second emitter, ADT growth, dedicated infra contributor) are not met. This is being executed as a user-directed architectural cleanup.
17.6 Phase status
| Phase | Deliverable | Status |
|---|---|---|
| 0 | Decision document (this §17) | shipped (PR #204) |
| 1 | expr_full + 20 records + span_t in IR.thy; extraction smoke | shipped (PR #204) |
| 2 | option_span ripple across IR/Semantics/Translate/Soundness | shipped (PR #204) |
| 3 | lower v1 + lower_set_list + lower_soundness corollary | shipped (PR #204) |
| 4 | Codegen.thy export extension; relocate SpecRestGenerated.scala | shipped (PR #204) |
| 5 | Scala wrapper layer in Types.scala (extracted types used directly; thin layer) | shipped (PR #204) |
| 6 | Parser migration | shipped (PR #204) |
| 7 | Codegen + testgen migration | shipped (PR #204) |
| 8 | Verify migration | shipped (PR #204); lower-backed Trust dim. + --strict-soundness flag landed under #205 (Phase C-flagged); strict-soundness flipped to default-on and extracted-translator routing landed under #192 |
| 9 | CLI migration; delete legacy Types.scala enum | shipped (PR #204) |
| 10 | Audit cleanup; doc/CI sweep | 10a doc sweep shipped (PR #204); 10b audit cleanup + lower v2 (QuantifierF 4 kinds, multi-binding, multi-field WithF) shipped in close-out PR |
17.7 Trust surface (issues #205 → #192)
lower shipped as a runtime projection in #202; #205 wired it into the verifier surface
as a per-check Trust dimension behind a --strict-soundness flag; #192 promoted strict
soundness to the default and routed in-subset checks through the verified extracted
translator on the production verify path.
CheckResult.trust: TrustLevel,Soundif every sourceexpr_fullfor the check lowers to the verified subset;BestEffortotherwise. Visible in:- CLI line:
[sound]/[best-effort]tag between the tool tag and the check id. - JSON report:
"trust": "sound" | "best-effort"per check object. Schema bumped 1→2.
- CLI line:
- Best-effort checks always skip with
category = soundness_limitationbefore backend dispatch (no flag, this is the default since #192). The--strict-soundnessflag and its CLI surface were retired. - Sound checks route via extracted translator.
Translator.translateExprcallslower(enums, e); onSomeit routes throughSpecRestGenerated.translate(the Isabelle-extracted function) and then a smallSmtTermToZ3bridge (~250 LOC, hand-written) to reach Z3. OnNonethe hand-writtentranslateExprRawruns, but only for declaration-level expressions (entity field constraints, type-aliaswhere-clauses); out-of-subset check-body shapes are filtered to skip by the Trust classifier before ever reaching the translator. - Exit code 4 (
ExitCodes.Trust): emitted when soundness skips are the only reason the run isn't clean. Subordinate to Backend (3), Violations (1), Translator (2). - TCB audit lives at
docs/research/11_tcb_audit.md, the single ledger of whatverify's verdicts actually depend on.
| Coverage v3 (post-#210) | lower returns | Trust |
|---|---|---|
| BoolLit / IntLit / Ident / EnumAccess / arithmetic / compare / boolean / Let / FieldAccess / Prime / Pre / cardinality on Ident / In/NotIn against Ident | Some | Sound |
QuantifierF over enum or relation domain (4 kinds, multi-binding) | Some | Sound |
Multi-field WithF (folded) over Identifier base | Some | Sound |
IndexF over IdentifierF base (e.g. users[uid]) | Some | Sound |
IndexF over PreF (IdentifierF rel) (e.g. pre(orders)[id]), M_L.4.l carrier widening | Some | Sound |
IndexF over PrimeF (IdentifierF rel) (e.g. orders'[id]), M_L.4.l carrier widening | Some | Sound |
IndexF over a non-relation-reference base (arithmetic, nested IndexF, etc.) | None | BestEffort |
BSubset, CallF, IfF, TheF, MapLiteralF, ConstructorF, SetComprehensionF, etc. | None | BestEffort |
Issue #210 (M_L.4.l) widened the
verified-subset IndexRel carrier from String.literal × expr to expr × expr and
added a syntactic peel_relation_ref recogniser for the three
relation-reference shapes (Ident, Pre Ident, Prime Ident). The widening avoids the
ir_value cascade that a value-level VRelation would have triggered: eval/smt_eval
peel syntactically, the universal soundness theorem closes with one rewritten per-case
lemma (index_rel_step) plus a peel_smt_relation_ref ↔ peel_relation_ref correlation
lemma, and the production bridge resolves the relation name + state-mode at the
TIndexRel site (rather than inheriting from the surrounding withStateMode for bare
IndexRel shapes). auth_service and broken_url_shortener Tamper-style operations
flip from skipped (category=soundness_limitation) to real Z3 verdicts.
17.8 Risks tracked
- R1, Code-extraction quality on records. #193 Phase 5 hit
Illegal fixed variable 'a'on polymorphic record-scheme. The fix was[code]-markeddefs. ~20 new records in this issue re-test the workaround. Phase 1 smoke-build is the trip-wire. - R2, Wrapper-layer drift. ~600-800 LoC of
apply/unapplyin the Scala wrapper trades A1-A8 drift for wrapper-level drift. Mitigation: round-trip wrapper test exercising all 27 ctors. - R3, Span-noise volume. ~94
_wildcard insertions acrossSoundness.thy. Buffer +1 day ifcases ... simp_allpatterns break. - R4,
In/NotIn/Subsetdesugar moves from Scala to Isabelle. Phase 3'slowerperforms the rewrite Isabelle-side. Defer proof obligation to a follow-up; keep the rewrite as a definition only in v1. - R5, Diagnostic regression on
e.toString. Hand pretty-printer in Phase 5 mitigates; snapshot-diff vs current parser-error fixtures. - R6, Native-image breakage. Wrapper layer + extracted types may need new reflect-config. Phase 10 sbt nativeImage smoke.