Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another poor performance about expression parsing #994

Closed
qwerppoo opened this issue Sep 15, 2015 · 21 comments
Closed

Another poor performance about expression parsing #994

qwerppoo opened this issue Sep 15, 2015 · 21 comments

Comments

@qwerppoo
Copy link

I'm using ANTLR-4.4 with a grammar file modified from sqlite.g4,and find some complex expressions cost too much time to parse.And then I found issue#192 .

That grammar file works fast with the antlr version I'm using now.But when I add some rules I used in my grammar file,it get slow down.

To make the issue simple,the grammar I use is just like this:

grammar expr;

expr: ID
    | 'not' expr
    | expr 'and' expr
    | expr 'or' expr
    | expr 'between' expr 'and' expr
    ;

ID: [a-zA-Z_][a-zA-Z_0-9]*;
WS: [ \t\n\r\f]+ -> skip;
ERROR: .;

and the sample file is just same like

not X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
    X1 and not X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12 or
not X1 and     X2 and not X3 and not X4 and not X5 and not X6 and not X7 and not X8 and not X9 and not X10 and not X11 and not X12

I tried this with the antlr plugin in IDEA,just 3 lines of the expressions will make the parsing time unacceptable.

@qwerppoo qwerppoo changed the title another Poor performance about expression parsing Another poor performance about expression parsing Sep 15, 2015
@KvanTTT
Copy link
Member

KvanTTT commented Sep 18, 2015

Your should use two-stage strategy as decribed in pointed issue: #192 (comment)
It is solve your problem with slow parsing.

@sharwell
Copy link
Member

Two stage parsing does resolve the issue when the input is correct, but error recovery results in a fallback to the second stage where it goes slow again.

The solution to this is making every precedence rule transition in the prediction algorithm follow a precedence-dependent edge, as opposed to the current behavior where precedence is only considered for the DFA start state. It would be a very complicated change to the way the DFA cache works, but obviously be a big win in some real scenarios.

@KvanTTT
Copy link
Member

KvanTTT commented Sep 25, 2015

Anyway, for this grammar performance is good with two-stage parsing even for input file with errors (Maybe EOF or some input market missing in this grammar).

@WolfgangFahl
Copy link

A JUnit test with a similar grammar shows the following behavior for adding "AND" clauses in a loop and parsing ever growing AND expressions:

@Test
  public void testNestedAndCode() throws Exception {
    // RuleLanguageParser.ONE_STAGE=false;
    String andClause="VALUE=0";
    for (int i=1;i<=12;i++) {
      long start = System.currentTimeMillis();
      andClause+="\n AND VALUE"+i+"=0\n";
      doTestRule("if\n" + andClause+
       "\nendif");     
      long stop = System.currentTimeMillis();
       System.out.println(String.format("%3d %4d msecs",i,stop - start));
    }
  }

With One Stage Parsing:

  1   418 msecs
  2    33 msecs
  3    39 msecs
  4    75 msecs
  5   189 msecs
  6   424 msecs
  7   943 msecs
  8   869 msecs
  9  2714 msecs
 10  6346 msecs
 11 14929 msecs
 12 39246 msecs

With Two Stage Parsing:

  1   362 msecs
  2    16 msecs
  3    37 msecs
  4    63 msecs
  5   151 msecs
  6   396 msecs
  7   623 msecs
  8  1039 msecs
  9  3154 msecs
 10  6924 msecs
 11 18117 msecs
 12 47501 msecs

In both cases the parsing time roughly doubles with every additional "AND". Since this is obviously inacceptable it would be good to have a solution or workaround.

I have now added a JUnit test to my environment that fails if the average performance ratio for adding an additional "AND" is higher than 1.2. I'd rather see a linear behavior, though.

@Test
  public void testNestedAndCode() throws Exception {
    // RuleLanguageParser.ONE_STAGE=false;
    String andClause="VALUE=0";
    long prevduration=0;
    double ratiosum=0;
    int ratiocount=0;
    for (int i=1;i<=6;i++) {
      long start = System.currentTimeMillis();
      andClause+="\n AND VALUE"+i+"=0\n";
      doTestRule("if\n" + andClause+
       "\nendif");     
      long stop = System.currentTimeMillis();
      long duration=stop-start;
      System.out.println(String.format("%3d %5d msecs",i,duration));
      if (i>2) {
        double ratio=duration/prevduration;
        ratiosum+=ratio;
        ratiocount++;
      }
      prevduration=duration;
    }
    double averageRatio=ratiosum/ratiocount;
    System.out.println(String.format("ratio: %3.2f",averageRatio));
    assertTrue(averageRatio<1.2);
  }

@WolfgangFahl
Copy link

Here is a Junit Test for the grammar originally specified for this issue:

import static org.junit.Assert.*;

import java.io.IOException;

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.junit.Test;

/**
 * Test the Issue 994 performance 
 * @author wf
 *
 */
public class TestIssue994 {
  boolean debug=false;


  /**
   * see https://github.com/antlr/antlr4/issues/994
   * @throws Exception
   */
    @Test
    public void testIssue994() throws Exception {
    String andClause="not X0";
    long prevduration=0;
    double ratiosum=0;
    int ratiocount=0;
    for (int i=1;i<=20;i++) {
      long start = System.currentTimeMillis();
      if (i%4==1) {
        andClause+=" or X"+i;
      } else {
        andClause+=" and not X"+i;
      }
      doTestExpr(andClause);     
      long stop = System.currentTimeMillis();
      long duration=stop-start;
      if (i>=2) {
        double ratio=duration*1.0/(prevduration*1.0);
        System.out.println(String.format("%3d %5d msecs %5.1f",i,duration,ratio));
        if (duration>25) {
          ratiosum+=ratio;
          ratiocount++;
        }
      }
      prevduration=duration;
    }
    double averageRatio=ratiosum/ratiocount;
    System.out.println(String.format("ratio: %3.2f",averageRatio));
    assertTrue("Performance issue https://github.com/antlr/antlr4/issues/994",averageRatio<1.2);

    }

    /**
     * do parse the given expression
     * @param andClause
     * @throws IOException
     */
  private void doTestExpr(String andClause) throws IOException {
    if (debug)
      System.out.println(andClause);
    ANTLRInputStream in = RuleLanguageParser.streamForText(andClause);
    exprLexer lexer = new exprLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    exprParser parser = new exprParser(tokens);
    parser.expr();
  }

}

it has the following time behaviour:

len   time           ratio
  2    12 msecs   0,1
  3     4 msecs   0,3
  4     8 msecs   2,0
  5     8 msecs   1,0
  6     9 msecs   1,1
  7    11 msecs   1,2
  8    20 msecs   1,8
  9    31 msecs   1,6
 10    47 msecs   1,5
 11    73 msecs   1,6
 12   135 msecs   1,8
 13   230 msecs   1,7
 14   189 msecs   0,8
 15   203 msecs   1,1
 16   246 msecs   1,2
 17   452 msecs   1,8
 18   934 msecs   2,1
 19  1723 msecs   1,8
 20  2702 msecs   1,6
ratio: 1,55

@jimidle
Copy link
Collaborator

jimidle commented Apr 6, 2016

Performance tests that measure just a single iteration of anything are
usually not a good idea, especially in Java. You should re-jig this to
create the parser, then run say 2000 iterations and divide to get an
average. Really you should also measure the loop time itself and take that
off, but an empty loop may optimize out.

I don't dispute your thoughts but you may find that the performance
difference is not as much as you think.

Jim

On Mon, Apr 4, 2016 at 5:03 PM, Wolfgang Fahl notifications@github.com
wrote:

Here is a Junit Test for the grammar originally specified for this issue:

import static org.junit.Assert.;
import java.io.IOException;
import org.antlr.v4.runtime.ANTLRInputStream;import org.antlr.v4.runtime.CommonTokenStream;import org.junit.Test;
/
* * Test the Issue 994 performance * @author wf * */public class TestIssue994 {
boolean debug=false;

/* * see #994 * @throws Exception _/
@test
public void testIssue994() throws Exception {
String andClause="not X0";
long prevduration=0;
double ratiosum=0;
int ratiocount=0;
for (int i=1;i<=20;i++) {
long start = System.currentTimeMillis();
if (i%4==1) {
andClause+=" or X"+i;
} else {
andClause+=" and not X"+i;
}
doTestExpr(andClause);
long stop = System.currentTimeMillis();
long duration=stop-start;
if (i>=2) {
double ratio=duration_1.0/(prevduration
1.0);
System.out.println(String.format("%3d %5d msecs %5.1f",i,duration,ratio));
if (duration>25) {
ratiosum+=ratio;
ratiocount++;
}
}
prevduration=duration;
}
double averageRatio=ratiosum/ratiocount;
System.out.println(String.format("ratio: %3.2f",averageRatio));
assertTrue("Performance issue https://github.com/antlr/antlr4/issues/994",averageRatio<1.2);

}

/**     * do parse the given expression     * @param andClause     * @throws IOException     */

private void doTestExpr(String andClause) throws IOException {
if (debug)
System.out.println(andClause);
ANTLRInputStream in = RuleLanguageParser.streamForText(andClause);
exprLexer lexer = new exprLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
exprParser parser = new exprParser(tokens);
parser.expr();
}

}

it has the following time behaviour:

2 12 msecs 0,1
3 4 msecs 0,3
4 8 msecs 2,0
5 8 msecs 1,0
6 9 msecs 1,1
7 11 msecs 1,2
8 20 msecs 1,8
9 31 msecs 1,6
10 47 msecs 1,5
11 73 msecs 1,6
12 135 msecs 1,8
13 230 msecs 1,7
14 189 msecs 0,8
15 203 msecs 1,1
16 246 msecs 1,2
17 452 msecs 1,8
18 934 msecs 2,1
19 1723 msecs 1,8
20 2702 msecs 1,6
ratio: 1,55


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#994 (comment)

@WolfgangFahl
Copy link

@jimidle - thank you for your comment. Please note that I stopped the loop at 20 on purpose. If you keep almost doubling the timing from there you will end up with 20 minutes parsing time after some 10 more loops and some 300 hours parsing time after additional 20 loops. Keep adding and you'll basically wait for ever. This is called exponential growth and should not happen.

Running this a few thousand iterations is prohibitive since I do not need a more precise result I need a different time behaviour see https://en.wikipedia.org/wiki/Big_O_notation

In the ANTLR techreport http://www.antlr.org/papers/allstar-techreport.pdf
you'll find the statement

ALL() is O(n_*4) in theory but consistently per- forms linearly on grammars used in practice, outperforming general strategies such as GLL and GLR by orders of magnitude._

The basis of this statement seems to be that the the two stage parsing approach as outlined in paragraph 3.2 Two-stage ALL(*) parsing is used.

This means the SLL and LL strategies have to be used one after another. It also means that you might have to adapt your grammar. The grammar of this Issue e.g. is fit for the two stage strategy. Unfortunately a bigger grammar I am currently analyzing is not and I still have to find out how to make it SLL - this is the reason why I posted my results here in the first place. One way to get my grammar more SLL like was to change the order of grammar rules putting possibly "longer" matches before shorter matches. E.g.

in_clause:
    field (IN field) * |
    field
   ;

is better than

in_clause:
    field |
    field (IN field) * 
   ;

@jimidle
Copy link
Collaborator

jimidle commented Apr 6, 2016

Youre rule can be just:

sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
sharwell added a commit to sharwell/antlr4 that referenced this issue Nov 23, 2016
@parrt parrt added this to the 4.6 milestone Nov 24, 2016
parrt added a commit to parrt/antlr4 that referenced this issue Nov 24, 2016
…ery large expression input phrases; builds off of @sharwell solution that explicitly checks for key return states in expr rules
@parrt parrt closed this as completed in fca5e45 Nov 25, 2016
parrt added a commit that referenced this issue Nov 25, 2016
Fixes #994; poor left-recursive rule performance
@parrt
Copy link
Member

parrt commented Nov 25, 2016

Ok, @sharwell came up with an optimization that gives like 10,000x performance on big expression input such as this. Just merged a slight variation on Sam's implementation. @sufail @danielsun1106 @WolfgangFahl @KvanTTT can you try out the latest antlr/antlr4 master?

@sharwell
Copy link
Member

@danielsun1106 I've merged this into my optimized branch (which you are using now) as well in sharwell/antlr4@f5776dd4, so if you have the ability to build from source you could try that already. Otherwise I'll be pushing a patch release soon (next couple of days) and you can try that.

@daniellansun
Copy link

@sharwell I'd like to have a try in a day or two via building from source. As you know, groovy-parser project contains about 13000 test cases. We should see massive performance improvement after using the recently optimized version. That would be really great :)
Thanks for your nice work!

@WolfgangFahl
Copy link

Thanx - this looks good. My checking will have to wait until end of december. I am definitely curious to see the effects and see that the other testcases in my project are not effected.

@sufail
Copy link

sufail commented Dec 11, 2016

@parrt , @sharwell
Is there any antlr-4.6-Snapshot.jar or maven dependency for the latest jar which includes the fixes for this available? I am finding it's difficult to build the sources myself for the master branch in my environment.
And the Antlr-maven plugin.jar as well?

@daniellansun
Copy link

daniellansun commented Dec 11, 2016

@sufail
If you are using Sam's optimized version of antlr4, I have the latest jar that you want. BTW, I'm looking forward for the new release too :)
https://github.com/danielsun1106/groovy-parser/blob/master/lib/antlr4.jar

@parrt
Copy link
Member

parrt commented Dec 11, 2016 via email

@parrt
Copy link
Member

parrt commented Dec 11, 2016

@danielsun1106 @sufail Can you try mvn for 4.6-SNAPSHOT? I did a deploy which looks like it put stuff here: https://oss.sonatype.org/content/repositories/snapshots/org/antlr/antlr4/4.6-SNAPSHOT/ etc...

@sufail
Copy link

sufail commented Dec 12, 2016

@parrt @danielsun1106 @sharwell
I was able to build the project in Windows , which added the jars to my local maven repository. The Jars I am interested in are antlr-runtime-4.6-SNAPSHOT.jar and antlr4-maven-plugin-4.6-SNAPSHOT.jar. My Project was in Unix VM and I was able to install these two jars over there successfully. But the antlr-4-maven-plugin is not working there (saying unsatisfied dependency) .

Write now I am generating code for the grammars using antlr-maven-4.5 plugin for a runtime of antlr-runtime-4.6-SNAPSHOT version. Hope this okay , I am getting a warning message for this though which is coming from the RuntimeMetaData class.

The performance is really good with the new jar in place :) . So no more need to worry about the left recursive rules span over all expression rules? Still I need to break the left recursive rules to more primitive rules for files with insane String manipulations?

@sufail
Copy link

sufail commented Dec 13, 2016

@parrt @danielsun1106 @sharwell

The new builds are only targetted for java >=7 ?? Cause it's not getting compiled in java 6. The tool and antlr-4 -runtime sources have something like this
In Grammar.java Set visited = new HashSet<>();
In ParserRuleContext.java this.children = new ArrayList<>();

It's saying Diamond operator not supported in java 6. I had to fill in the appropriate types there to get this compiled.
Ignore this if Java 6 being deliberately getting avoided.

@parrt
Copy link
Member

parrt commented Dec 13, 2016

@sufail I have moved the project to java 7 now.

@sharwell
Copy link
Member

@sufail The reference release of ANTLR now targets Java 7+, since it makes things easier for everyone trying to work on this repository. My fork has more enterprise users that care more about compatibility, so it will continue to support Java 6 throughout the life of ANTLR 4.

janyou added a commit to janyou/antlr4 that referenced this issue Dec 14, 2016
parrt added a commit that referenced this issue Dec 14, 2016
wangyum pushed a commit to 1haodian/spark that referenced this issue Aug 24, 2017
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.

The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```

This is caused by a know bug in ANTLR (antlr/antlr4#994), which was fixed in version 4.6.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes apache#19042 from hvanhovell/SPARK-21830.
@WolfgangFahl
Copy link

There is now an open source project with some helper function that also includes the Unit Test for this issue see

curtishoward pushed a commit to curtishoward/spark that referenced this issue Mar 23, 2018
## What changes were proposed in this pull request?
This PR bumps the ANTLR version to 4.7, and fixes a number of small parser related issues uncovered by the bump.

The main reason for upgrading is that in some cases the current version of ANTLR (4.5) can exhibit exponential slowdowns if it needs to parse boolean predicates. For example the following query will take forever to parse:
```sql
SELECT *
FROM RANGE(1000)
WHERE
TRUE
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
AND NOT upper(DESCRIPTION) LIKE '%FOO%'
```

This is caused by a know bug in ANTLR (antlr/antlr4#994), which was fixed in version 4.6.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes apache#19042 from hvanhovell/SPARK-21830.

(cherry picked from commit 05af2de)
(cherry picked from commit 3d4892f1d4fe8556bdd010056df645f2f1aafe56)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants