-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dynamic disambiguation of left-recursive rules during closure #1398
Comments
Fixes antlr#1398 Fixes antlr#994
Fixes antlr#1398 Fixes antlr#994
Fixes antlr#1398 Fixes antlr#994
Adding my own interpretation after Sam and I simplified in our minds. Consider slightly simpler grammar:
The (annotated) ATNs are: where
The closure of a StarLoopEntryState (state 23) in a rewritten LR rule includes a config that pursues the loop entry (orange) path and a config that pursues the (green) loop bypass (rule exit) path. The stack context is same for both configs added for entry and exit paths. The optimization is to avoid adding the loop entry config when the exit path can only lead back to the same StarLoopEntryState after popping context at the rule end state (traversing only epsilon edges, so we're still in closure, in this same rule). In the ATN above, if the stack(s) for the current config are only the (implicit) green paths from rule end state 3 to block end state 22 (i.e., no red paths) then we can drop the config for the orange loop entry path. @sharwell just pointed out there are more nodes we have to detect. It looks like we need to detect any state that can reach loop entry on epsilon w/o exiting rule. We don't have to look at FOLLOW links, just ensure that all stack tops for config refer to key states in LR rule. We have to check for prefix operator alt right edges and right edges of ternary ops but not the middle Another version of the prefix op is the What should the To verify we are in the right situation we must first check closure is at a StarLoopEntryState generated during LR removal. Then we check that each stack top of context is a return state from one of these cases:
If any is true for each stack top, then closure does not add a config to the current config set for edge[0], the loop entry branch. Conditions fail if any context for the current config is:
Do we need to evaluate predicates ever in closure for this case? No. Predicates, including precedence predicates, are only evaluated when computing a DFA start state. I.e., only before the lookahead (but not parser) consumes a token. There are no epsilon edges allowed in LR rule alt blocks or in the "primary" part ( How do we know this always gives same prediction answer? Without predicates, loop entry and exit paths are ambiguous upon remaining input The lookahead language has not changed because closure chooses one path over the other. Both paths lead to consuming the same remaining input during a lookahead operation. If the next token is an operator, lookahead will enter the choice block with operators. If it is not, lookahead will exit Closure is examining one config (some loopentrystate, some alt, context) which means it is considering exactly one alt. Closure always copies the same alt to any derived configs. How do we know this optimization doesn't mess up precedence in our parse trees? Looking through |
Background
Since the early days of ANTLR 4, we've known that long lookahead inside of left-recursive (LR) rules poses "interesting challenges" for performance. The earliest approaches to this problem were attempts to inline the evaluation of special "precedence predicates" during prediction, rather than treat them as standard predicates that are only evaluated after an SLL conflict is reached. Unfortunately, after attempting to implement this I realized a major flaw in the design - LR rules are fundamentally ambiguous, and rely on ANTLR's "first alternative wins" decision making strategy to choose the expected alternative.
Eventually we realized that the solution to much of the performance problem associated with LR rules could be solved by disambiguating the ATN prior to making a decision. This is accomplish by eliminating known ambiguous configurations before running the
adaptivePredict
algorithm.The initial implementation appeared in #401, with notable bugs identified in #509 and #679 (now fixed).
Remaining problem
While the above work dramatically improved performance in real-world scenarios where LR rules were used, it failed to meet the performance expectations of users working with grammars and inputs that require "looking through" a LR rule to make a prediction. As a simple example, consider the following grammar fragment for a made-up language:
In this example, the expression following an
if
keyword in the language must be fully consumed by theadaptivePredict
algorithm only to determine whether or not it is followed by anelse
keyword. I refer to this situation as "looking through expression to make a decision in statement". The performance impact of grammars with this characteristic is tremendous - theadaptivePredict
algorithm will explore all possible paths through expression during the prediction! This is clearly not what the user intended when expression was written as an LR rule.For reference, here is the ATN for
statement
:And here is the ATN for
expression
:Several issues have been filed which appear to be manifestations of this issue in some form:
Dynamic disambiguation
Early work on generalizing the
applyPrecedenceFilter
algorithm to "look through" situations focused on ways to apply the precedence filter to decisions traversed deep inside theadaptivePredict
algorithm. The implementation plan looked something like this:PredictionContext
class to keep track of precedence levels when a precedence rule is invokedclosure
operations, identify precedence decisions using the same algorithm we already use to create precedence DFAsapplyPrecedenceFilter
algorithmclosure
operation being performedHowever, the subtleties of this approach kept me from ever getting it implemented (even as a prototype).
The real breakthrough came today when I took a much closer look at the syntactic ambiguity produced by LR rules. Remember that aside from computing the DFA start state,
adaptivePredict
asks only one thing: which alternative provides some viable path to consume more input symbols than any other alternative? For a true ambiguity, then for some input(s) there is more than one such alternative whether we are in SLL or LL prediction modes. More importantly,adaptivePredict
doesn't care at all how that longest sequence is reached. For example, consider the following input with the grammar listed above:Clearly the intended final parse tree for this expression would look like the following:
However, when
adaptivePredict
is attempting to choose the first or second alternative of the statement rule, the precedence within the expression is totally irrelevant. A parse looking like the following would lead to exactly the same final result as the desired expression:This property can be generalized by the following:
Leveraging this property we can redefine the implementation plan to eliminate the opposite set of configurations from the configurations eliminated by
applyPrecedenceFilter
. It also turns out this is a wonderful idea because it is much simpler from an implementation standpoint. TheStarLoopEntryState
that forms a precedence decision has two outgoing edges:If any return edge from the rule stop state leads to the right edge of a left-recursive invocation of the rule, then anything reachable by following the first edge is also reachable by following the second edge. Knowing this, we can implement dynamic disambiguation by intentionally failing to
include the target of the first edge in the closure operation. With the edge eliminated, one specific class of ambiguities, specifically those created by the use of LR rules in a grammar, are reduced to an unambiguous single alternative during
adaptivePredict
. The fact that the path taken through the grammar duringadaptivePredict
doesn't match the path taken through the grammar to produce the final parse tree does not change the overall outcome.The text was updated successfully, but these errors were encountered: