Parser Implementation Notes
Edit on GitHubANTLR4 grammar design decisions and deviations from the EBNF spec
Last updated:
This page documents where the ANTLR4 grammar (parser/Spec.g4) diverges from the
EBNF specification and the reasoning behind each
deviation.
Keyword-as-identifier problem
EBNF assumption: Identifiers and keywords are context-free, output in output.url is just
an identifier.
ANTLR4 reality: The lexer is context-independent. Once output is defined as the OUTPUT
keyword token, it is always lexed as OUTPUT, never as LOWER_IDENT. This means
Resolve.http_header "Location" = output.url fails because the parser sees OUTPUT where it
expects an expression identifier.
Solution: A lowerIdent parser rule that accepts LOWER_IDENT and all keywords that may
appear in identifier positions:
lowerIdent
: LOWER_IDENT
| INPUT | OUTPUT | FIELD | STATE
| ONE | LONE | SET_MULT
| VIA | WHEN | WHERE | WITH | EXTENDS
| REQUIRES | ENSURES
| ENTITY | OPERATION | TRANSITION | INVARIANT | FACT
| CONVENTIONS | FUNCTION | PREDICATE | TYPE | ENUM | IMPORT
| SERVICE
;This rule is used everywhere the EBNF says LOWER_IDENT: field declarations, parameter names,
state fields, convention properties, expression identifiers, quantifier bindings, let/lambda
variable names, and field access (.field).
Trade-off: Every keyword becomes a valid identifier, which is overly permissive at the grammar
level. Semantic analysis (a later phase) should reject nonsensical uses like naming a field
service.
set multiplicity vs Set[T] type
EBNF assumption: set (lowercase) is a multiplicity keyword, Set (uppercase) is the
parameterized collection type. The EBNF treats them as separate tokens implicitly.
ANTLR4 reality: The SET keyword is defined as 'Set' (uppercase S). Lowercase set in
Int -> set String gets lexed as LOWER_IDENT, which doesn't match the multiplicity rule.
Solution: A separate lexer token SET_MULT: 'set' for the lowercase multiplicity, distinct
from SET: 'Set' for the type constructor. The multiplicity rule uses SET_MULT:
multiplicity: ONE | LONE | SOME | SET_MULT ;Newline handling in requires/ensures
EBNF spec: expr_list = expr { NEWLINE expr }, newlines are significant separators between
top-level expressions in requires: and ensures: blocks.
ANTLR4 implementation: All whitespace (including newlines) goes to channel(HIDDEN). The
blocks use expr+ instead of newline-delimited lists.
Why this works: Expressions in this DSL are self-delimiting; there is no juxtaposition or implicit application operator. After parsing one complete expression, the next token is either:
- A binary operator (
and,+,=, etc.) -> continues the current expression - An identifier, keyword, or prefix operator (
#,-,not,all, etc.) -> starts a new expression
Since two expressions side-by-side without an operator between them cannot form a valid larger
expression, ANTLR4's greedy expr+ naturally splits them at the right boundaries.
Example parse:
ensures:
code not in pre(store) <- expr 1: comparison
store' = pre(store) + {code -> url} <- expr 2: equalityAfter pre(store), the next token store is an identifier with no infix operator preceding it,
so the parser terminates expr 1 and starts expr 2.
Risk: If the DSL ever adds a juxtaposition operator (e.g., implicit function application),
this approach breaks. The fallback would be reintroducing a NEWLINE token on a separate channel
and using it as an explicit separator.
Regex literals vs division operator
EBNF spec: Both REGEX_LIT (/pattern/) and division (/) are defined. This creates a
classic lexer ambiguity: a / b / c could be tokenized as division or as a regex literal.
Solution: A lexer predicate restricts REGEX_LIT to only match immediately after the
MATCHES keyword. All other / tokens are lexed as SLASH (division).
The lexer tracks the most recent non-hidden token type via an overridden nextToken():
@lexer::members {
private lastNonWsType: number = -1;
public override nextToken(): antlr.Token {
const token = super.nextToken();
if (token.channel === 0) {
this.lastNonWsType = token.type;
}
return token;
}
}
REGEX_LIT
: {this.lastNonWsType === SpecLexer.MATCHES}? '/' ~[/\r\n]+ '/'
;This means value matches /^[a-z]+$/ correctly lexes the regex, while a / b / c produces
three tokens: a SLASH b SLASH c.
Error listener wiring
ANTLR's default listeners print to stderr and don't fail the parse. We collect parse errors
into List[ParseError] and surface them as VerifyError.Parse by overriding BaseErrorListener
on both the lexer and the parser, after first removing the defaults:
val listener: BaseErrorListener = new BaseErrorListener:
override def syntaxError(
recognizer: Recognizer[?, ?],
offendingSymbol: Any,
line: Int,
column: Int,
msg: String,
e: RecognitionException
): Unit =
errors += ParseError(line, column, msg)
lexer.removeErrorListeners()
parser.removeErrorListeners()
lexer.addErrorListener(listener)
parser.addErrorListener(listener)The Scala 3 wildcard Recognizer[?, ?] matches the Java runtime's Recognizer<?, ?> raw
generic; offendingSymbol: Any covers the Object parameter that may be a Token (parser)
or null (lexer). See parser/Parse.scala.
Grammar compilation
Parser generation is driven by the sbt-antlr4 plugin: placing the grammar at
parser/Spec.g4 is enough. sbt parser/compile generates Java lexer /
parser / visitor classes into src_managed/main/antlr4/specrest/parser/generated/ at the package
configured in build.sbt:
Antlr4 / antlr4Version := "4.13.2"
Antlr4 / antlr4PackageName := Some("specrest.parser.generated")
Antlr4 / antlr4GenVisitor := trueNo hand-run CLI is needed; the generated Java sources are invisible to developers and refreshed
automatically whenever Spec.g4 changes.
Expression precedence strategy
EBNF approach: 10+ separate named rules (or_expr, and_expr, not_expr, etc.) forming a
precedence chain.
ANTLR4 approach: A single left-recursive expr rule with labeled alternatives. ANTLR4
assigns implicit precedence by alternative order: first alternative = highest precedence
(tightest binding), last = lowest. The alternatives are ordered accordingly:
Alternative order in grammar (first = tightest):
1. postfix (', ., [], ())
2. with (record update)
3. unary prefix (#, -, ^)
4. multiplicative (*, /)
5. additive (+, -)
6. set operations (union, intersect, minus)
7. comparisons (=, !=, <, >, <=, >=, in, not in, subset, matches)
8. implies, iff
9. not (unary)
10. and
11. or
12. primaries (literals, identifiers, quantifiers, not left-recursive, no precedence)The two top-level structural rules render as:
Regenerate the SVGs with node docs/scripts/build-railroad.mjs whenever the grammar changes.
Initial mistake: The first version of the grammar had the order inverted (or first, postfix
last). This produced correct parse acceptance (all tests passed) but incorrect parse trees:
a + b * c would have been parsed as (a + b) * c. The review caught this because the tests
only asserted zero errors, not tree structure.
This is more concise than the EBNF chain and leverages ANTLR4's native left-recursion elimination.
some disambiguation
The ambiguity: some(expr) (Option wrapper) vs some x in S | P (existential quantifier).
EBNF solution: Separate rules some_wrap_expr and quantifier_expr.
ANTLR4 solution: The someWrapExpr sub-rule and the quantifierExpr sub-rule are both
referenced from the expr rule's primary alternatives. ANTLR4's ALL(*) adaptive prediction
resolves the ambiguity by lookahead:
some (-> matchessomeWrapExpr: SOME LPAREN expr RPARENsome identifier in-> matchesquantifierExprwithSOMEas the quantifier
No semantic predicates are needed; the syntactic structure is sufficient.