Worse expected token output in ANTLR 4.7 #1922

alexet · 2017-06-21T11:12:01Z

Using the following grammar

member : 'a';
body: member*;
file : body EOF;
B: 'b';/* ensure we can lex b */

on input baa

we get mismatched input 'b' expecting <EOF> with antlr 4.7 but extraneous input 'b' expecting {<EOF>, 'a'} with antlr 4.6

The output of 4.6 is preferable as 'a' was a valid input at that position.

The recovered parse tree is also worse with 4.6 recovering the later members but 4.7 generating an empty body and error nodes for the later members.

This happens in general when there is a rule that is optionally empty. In this case we can get better results by inlineing body into file but that significantly increases the size of the grammar in the non reduced example and makes the java code for converting the tree more complex.

The text was updated successfully, but these errors were encountered:

michaelpj · 2017-07-11T11:06:36Z

This is serious problem for us, since it degrades almost every parse error message.

michaelpj · 2017-07-11T11:53:35Z

git bisect pegs 8b21cc3 as the commit where this starts.

michaelpj · 2017-07-11T12:15:53Z

I'm not sure what's going on here - it looks like either DefaultErrorStrategy.sync should return if it gets a Token.EOF, or LL1Analyzer._LOOK should return Token.EPSILON in more cases.

I note that LL1Analyzer._LOOK does return Token.EPSILON instead of Token.EOF if the prediction context is null. We pass it as null initially, but it looks like it could become non-null via some of the recursive calls.

cc @sharwell as the author of 8b21cc3

sharwell · 2017-07-11T13:48:08Z

I would need to see some additional examples here. The error message in this case is less than optimal, but it's definitely the intended outcome of that change. Since body parses without error for this input, by the time the error is detected code is no longer inside that rule and a is no longer valid as the next symbol.

sharwell · 2017-07-11T13:54:35Z

💭 It may be possible to remember the ATNConfigSet following the last matched character, and use that position to calculate the expected set in deferred error handling scenarios. In other words, while we don't change the location where the error is detected, we can change the message that gets calculated there.

michaelpj · 2017-07-11T14:14:21Z

Good to know that it's intentional. That makes this more of a feature request, then.

Consider the following slightly more realistic grammar:

member : 'function';
body: '{' member* '}';
file : body EOF;
Identifier: [a-z]+;

On input {functio} this gives extraneous input 'functio' expecting {'function', '}'}, which is helpful.

However, if we uninline member* like so:

member : 'function';
members: member* ;
body: '{' members '}';
file : body EOF;
Identifier: [a-z]+;

then we get the worse extraneous input 'functio' expecting '}'.

There are two reasons this feels undesirable:

It's a bit sad that inlining the rule gives a better error message. I appreciate that this may be inevitable for implementation reasons, but it would be nice if it wasn't the case.
This is a very common pattern when writing parsers for programming languages. You have some delimiters around a scope, and within that you have a (possibly empty) list of members. In that case it would be very helpful to give the user the potentially acceptable tokens if they make a typo.

Conceptually, I would expect error recovery to operate before leaving members, since there are two possible "next steps": one to another parse of a member, and one to }.

michaelpj · 2017-07-11T14:16:29Z

(The point of the different grammar above is that the problem is indeed not specific to EOF, as I had assumed earlier)

sharwell · 2017-07-11T14:44:26Z

As a general observation, the deeper in the tree you are when recovery is attempted, the worse recovery is in practice. On the contrary, the deeper in the tree you are when recovery is attempted, the better error messaging is (both locality and accuracy). The preference to step out of a rule before attempting recovery was originally implemented as part of the work for #529. I still think that was the right move for quality of both the parse tree and recovery, but we could improve upon the messaging that is produced for cases like you've described.

The simple improvement is keeping track of ATNConfigSet at or after decisions so they can be used in error reporting. For even better error reporting, I suggested to @RobertvanderHulst that two parsing passes could be used - in the first pass the default error strategy is used, but in the second pass subtrees containing parse errors could be re-parsed using a "bail after first error" strategy combined with a prediction algorithm that attempts to take the first viable alternative until an error appears as the LL(1) symbol. The goal was to provide precise and accurate error messages as well as great recovery behavior, but I haven't yet heard back about the results of that in practice.

michaelpj · 2017-07-11T16:15:58Z

I defer to your expertise on whether it's better to step out of a rule before recovery - I'm spitballing here 😁

However, I do want to reiterate that this is a real problem for us, since for whatever reason a lot of the formations in our grammar have this problem - it seems to hit to pretty much any rule that ends with a ?, + or * section.

Here's another similar case reduced from our grammar that has the same problem:

function : 'function' '(' args? ')';
args: arg (',' arg)*;
arg: Identifier ;
file : function EOF;

Identifier: [a-z]+;
Semi: ';' ;

On the input function(a;b), without inlining args this gives mismatched input ';' expecting ')'; whereas after inlining it gives mismatched input ';' expecting {',', ')'}.

In this case args is non-trivial enough that inlining it everywhere would be a pain.

RobertvanderHulst · 2017-07-11T19:15:28Z

@sharwell We are still working on this. In our case, in our X# language, we see most problems in the statementBlock rule which is declared as
statementBlock : (Stmts+=statement)*;
Our statement rule consists of many labeled alternatives, such as forStmt, ifStmt and of course expressionStatement.
expressionStatement is the last stament rule, declared as
Exprs+=expression (COMMA Exprs+=expression)* eos
The expression rule consists of the usual list of alternatives such as postfixExpression, prefixExpression, binaryExpression, assignmentExpression, primaryExpression etc of which most are recursive and use the expression rule again.

What we see is that if an error occurs in an expression rule, then the parser leaves the expression rule, leaves the expression statement rule and synchronizes at the end of the statementBlock. This usually gives very cryptic error messages and many lines of code are not even checked by the parser. So when the user fixes one typo (such as a missing comma or closing parenthesis) then the parser is likely to detect another and another.

The antlr reference suggests that the parser should either 'eat' unexpected tokens or add missing tokens and continue, but in the current version Antlr does not seem to do that.
We almost never see error messages such as 'missing Something' but usually see error messages such as 'unexpected input ...' after which sometimes many lines of input are skipped.

We have looked at the method you suggested but we do not have a working solution yet.

marcospassos · 2017-07-26T00:15:56Z

I can confirm that Antlr 4.7 introduced BC breaks. We just updated from 4.6 to 4.7 and 300+ unit tests are now broken. We have hundreds of error cases that were _ "missing token" in 4.6 and become "mismatched input" in 4.7.

marcospassos · 2017-07-26T00:16:21Z

Probably related to #1967

sharwell · 2017-07-26T11:33:17Z

🔗 For my own reference, the new behavior here was introduced in #1546.

Fixes antlr#1922

parrt · 2017-07-26T16:20:37Z

Hi all. The fix put in by @sharwell was important if I remember correctly. It looks like Sam is working on fixing the error messages. I wonder if this isn't something to do with the "sync" check before any EBNF construct. I remember we messed with that as well. I haven't looked deeply into this. I'm neck deep teaching a boot camp for the next three weeks.

Fixes antlr#1922

sharwell · 2017-07-28T17:45:32Z

I wonder if this isn't something to do with the "sync" check before any EBNF construct.

My solution requires that a call to Sync appear before every decision. This condition currently holds after we applied the template changes.

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922.

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922. This also changes DefaultErrorStrategy.reportInputMismatch to handle when values are missing.

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922. Enable the corresponding tests for Swift.

sharwell mentioned this issue Jul 26, 2017

Is ANTLR 4.7's exception handling method different from ANTLR 4.5? #1967

Open

sharwell added a commit to sharwell/antlr4 that referenced this issue Jul 26, 2017

Report InputMismatchException with original context information

278983f

Fixes antlr#1922

sharwell mentioned this issue Jul 26, 2017

Report InputMismatchException with original context information #1969

Merged

sharwell added a commit to sharwell/antlr4 that referenced this issue Jul 26, 2017

Report InputMismatchException with original context information

9a04d09

Fixes antlr#1922

sharwell added a commit to sharwell/antlr4 that referenced this issue Jul 27, 2017

Report InputMismatchException with original context information

0803c74

Fixes antlr#1922

parrt closed this as completed in #1969 Aug 1, 2017

sharwell mentioned this issue Oct 4, 2018

Generate RuleStack tunnelvisionlabs/antlr4cs#309

Open

ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018

[Swift] Report InputMismatchException with original context information

ae1036d

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922.

ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 10, 2018

[Swift] Report InputMismatchException with original context information

2d4bb9a

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922.

ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 11, 2018

[Swift] Report InputMismatchException with original context information

e42f534

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922. Enable the corresponding tests for Swift.

ewanmellor added a commit to ewanmellor/antlr4 that referenced this issue Nov 15, 2018

[Swift] Report InputMismatchException with original context information

842d3d7

Port 0803c74 from the Java runtime to Swift. This was issue antlr#1922. Enable the corresponding tests for Swift.

This was referenced Nov 15, 2018

[Swift] Report InputMismatchException with original context information #2402

Merged

Enable some performance tests on Swift. #2403

Merged

kaby76 mentioned this issue May 9, 2024

CSharp and Java producing different error messages for expected tokens #3700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worse expected token output in ANTLR 4.7 #1922

Worse expected token output in ANTLR 4.7 #1922

alexet commented Jun 21, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

sharwell commented Jul 11, 2017 •

edited

Loading

sharwell commented Jul 11, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

sharwell commented Jul 11, 2017 •

edited

Loading

michaelpj commented Jul 11, 2017

RobertvanderHulst commented Jul 11, 2017

marcospassos commented Jul 26, 2017

marcospassos commented Jul 26, 2017

sharwell commented Jul 26, 2017

parrt commented Jul 26, 2017

sharwell commented Jul 28, 2017

Worse expected token output in ANTLR 4.7 #1922

Worse expected token output in ANTLR 4.7 #1922

Comments

alexet commented Jun 21, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

sharwell commented Jul 11, 2017 • edited Loading

sharwell commented Jul 11, 2017

michaelpj commented Jul 11, 2017

michaelpj commented Jul 11, 2017

sharwell commented Jul 11, 2017 • edited Loading

michaelpj commented Jul 11, 2017

RobertvanderHulst commented Jul 11, 2017

marcospassos commented Jul 26, 2017

marcospassos commented Jul 26, 2017

sharwell commented Jul 26, 2017

parrt commented Jul 26, 2017

sharwell commented Jul 28, 2017

sharwell commented Jul 11, 2017 •

edited

Loading

sharwell commented Jul 11, 2017 •

edited

Loading