Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worse expected token output in ANTLR 4.7 #1922

Closed
alexet opened this issue Jun 21, 2017 · 15 comments
Closed

Worse expected token output in ANTLR 4.7 #1922

alexet opened this issue Jun 21, 2017 · 15 comments

Comments

@alexet
Copy link

alexet commented Jun 21, 2017

Using the following grammar

member : 'a';
body: member*;
file : body EOF;
B: 'b';/* ensure we can lex b */

on input baa

we get mismatched input 'b' expecting <EOF> with antlr 4.7 but extraneous input 'b' expecting {<EOF>, 'a'} with antlr 4.6

The output of 4.6 is preferable as 'a' was a valid input at that position.

The recovered parse tree is also worse with 4.6 recovering the later members but 4.7 generating an empty body and error nodes for the later members.

This happens in general when there is a rule that is optionally empty. In this case we can get better results by inlineing body into file but that significantly increases the size of the grammar in the non reduced example and makes the java code for converting the tree more complex.

@michaelpj
Copy link
Contributor

This is serious problem for us, since it degrades almost every parse error message.

@michaelpj
Copy link
Contributor

git bisect pegs 8b21cc3 as the commit where this starts.

@michaelpj
Copy link
Contributor

I'm not sure what's going on here - it looks like either DefaultErrorStrategy.sync should return if it gets a Token.EOF, or LL1Analyzer._LOOK should return Token.EPSILON in more cases.

I note that LL1Analyzer._LOOK does return Token.EPSILON instead of Token.EOF if the prediction context is null. We pass it as null initially, but it looks like it could become non-null via some of the recursive calls.

cc @sharwell as the author of 8b21cc3

@sharwell
Copy link
Member

sharwell commented Jul 11, 2017

I would need to see some additional examples here. The error message in this case is less than optimal, but it's definitely the intended outcome of that change. Since body parses without error for this input, by the time the error is detected code is no longer inside that rule and a is no longer valid as the next symbol.

@sharwell
Copy link
Member

💭 It may be possible to remember the ATNConfigSet following the last matched character, and use that position to calculate the expected set in deferred error handling scenarios. In other words, while we don't change the location where the error is detected, we can change the message that gets calculated there.

@michaelpj
Copy link
Contributor

Good to know that it's intentional. That makes this more of a feature request, then.

Consider the following slightly more realistic grammar:

member : 'function';
body: '{' member* '}';
file : body EOF;
Identifier: [a-z]+;

On input {functio} this gives extraneous input 'functio' expecting {'function', '}'}, which is helpful.

However, if we uninline member* like so:

member : 'function';
members: member* ;
body: '{' members '}';
file : body EOF;
Identifier: [a-z]+;

then we get the worse extraneous input 'functio' expecting '}'.

There are two reasons this feels undesirable:

  1. It's a bit sad that inlining the rule gives a better error message. I appreciate that this may be inevitable for implementation reasons, but it would be nice if it wasn't the case.
  2. This is a very common pattern when writing parsers for programming languages. You have some delimiters around a scope, and within that you have a (possibly empty) list of members. In that case it would be very helpful to give the user the potentially acceptable tokens if they make a typo.

Conceptually, I would expect error recovery to operate before leaving members, since there are two possible "next steps": one to another parse of a member, and one to }.

@michaelpj
Copy link
Contributor

(The point of the different grammar above is that the problem is indeed not specific to EOF, as I had assumed earlier)

@sharwell
Copy link
Member

sharwell commented Jul 11, 2017

As a general observation, the deeper in the tree you are when recovery is attempted, the worse recovery is in practice. On the contrary, the deeper in the tree you are when recovery is attempted, the better error messaging is (both locality and accuracy). The preference to step out of a rule before attempting recovery was originally implemented as part of the work for #529. I still think that was the right move for quality of both the parse tree and recovery, but we could improve upon the messaging that is produced for cases like you've described.

The simple improvement is keeping track of ATNConfigSet at or after decisions so they can be used in error reporting. For even better error reporting, I suggested to @RobertvanderHulst that two parsing passes could be used - in the first pass the default error strategy is used, but in the second pass subtrees containing parse errors could be re-parsed using a "bail after first error" strategy combined with a prediction algorithm that attempts to take the first viable alternative until an error appears as the LL(1) symbol. The goal was to provide precise and accurate error messages as well as great recovery behavior, but I haven't yet heard back about the results of that in practice.

@michaelpj
Copy link
Contributor

I defer to your expertise on whether it's better to step out of a rule before recovery - I'm spitballing here 😁

However, I do want to reiterate that this is a real problem for us, since for whatever reason a lot of the formations in our grammar have this problem - it seems to hit to pretty much any rule that ends with a ?, + or * section.

Here's another similar case reduced from our grammar that has the same problem:

function : 'function' '(' args? ')';
args: arg (',' arg)*;
arg: Identifier ;
file : function EOF;

Identifier: [a-z]+;
Semi: ';' ; 

On the input function(a;b), without inlining args this gives mismatched input ';' expecting ')'; whereas after inlining it gives mismatched input ';' expecting {',', ')'}.

In this case args is non-trivial enough that inlining it everywhere would be a pain.

@RobertvanderHulst
Copy link
Contributor

@sharwell We are still working on this. In our case, in our X# language, we see most problems in the statementBlock rule which is declared as
statementBlock : (Stmts+=statement)*;
Our statement rule consists of many labeled alternatives, such as forStmt, ifStmt and of course expressionStatement.
expressionStatement is the last stament rule, declared as
Exprs+=expression (COMMA Exprs+=expression)* eos
The expression rule consists of the usual list of alternatives such as postfixExpression, prefixExpression, binaryExpression, assignmentExpression, primaryExpression etc of which most are recursive and use the expression rule again.

What we see is that if an error occurs in an expression rule, then the parser leaves the expression rule, leaves the expression statement rule and synchronizes at the end of the statementBlock. This usually gives very cryptic error messages and many lines of code are not even checked by the parser. So when the user fixes one typo (such as a missing comma or closing parenthesis) then the parser is likely to detect another and another.

The antlr reference suggests that the parser should either 'eat' unexpected tokens or add missing tokens and continue, but in the current version Antlr does not seem to do that.
We almost never see error messages such as 'missing Something' but usually see error messages such as 'unexpected input ...' after which sometimes many lines of input are skipped.

We have looked at the method you suggested but we do not have a working solution yet.

@marcospassos
Copy link
Contributor

I can confirm that Antlr 4.7 introduced BC breaks. We just updated from 4.6 to 4.7 and 300+ unit tests are now broken. We have hundreds of error cases that were _ "missing token" in 4.6 and become "mismatched input" in 4.7.

@marcospassos
Copy link
Contributor

Probably related to #1967

@sharwell
Copy link
Member

🔗 For my own reference, the new behavior here was introduced in #1546.

@parrt
Copy link
Member

parrt commented Jul 26, 2017

Hi all. The fix put in by @sharwell was important if I remember correctly. It looks like Sam is working on fixing the error messages. I wonder if this isn't something to do with the "sync" check before any EBNF construct. I remember we messed with that as well. I haven't looked deeply into this. I'm neck deep teaching a boot camp for the next three weeks.

sharwell added a commit to sharwell/antlr4 that referenced this issue Jul 26, 2017
sharwell added a commit to sharwell/antlr4 that referenced this issue Jul 27, 2017
@sharwell
Copy link
Member

I wonder if this isn't something to do with the "sync" check before any EBNF construct.

My solution requires that a call to Sync appear before every decision. This condition currently holds after we applied the template changes.

ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018
ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018
ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018
Port 0803c74 from the Java runtime to Swift.  This was issue antlr#1922.

This also changes DefaultErrorStrategy.reportInputMismatch to handle
when values are missing.
ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018
Port 0803c74 from the Java runtime to Swift.  This was issue antlr#1922.

This also changes DefaultErrorStrategy.reportInputMismatch to handle
when values are missing.
ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 11, 2018
Port 0803c74 from the Java runtime to Swift.  This was issue antlr#1922.
Enable the corresponding tests for Swift.
ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 15, 2018
Port 0803c74 from the Java runtime to Swift.  This was issue antlr#1922.
Enable the corresponding tests for Swift.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants