Random ArrayIndexOutOfBounds exceptions coming from ParserATNSimulator #804

parrt · 2015-01-24T19:27:19Z

It looks like a transient problem, even in the environment where we saw the issue originally it hasn't been reproducible. Only happens in our automation environment (rehat 5.7 running on open stack running on I dunno what) and only occasionally.

It happens in here:

synchronized (from) {
    if ( from.edges==null ) { 
        from.edges = new DFAState[atn.maxTokenType+1+1];
    }   

    from.edges[t+1] = to; // connect
}

so in some case atn.maxTokenType+1 > t must be true.

I checked and it looks like the field "edges" is always protected by its parent DFAState object except for in one suspicious place where it appears to be guarded by an instance of DFA in DFA.java's setPrecedenceDfa method. (all in 4.3)

It would seem on the face of it that you need to synchronize inside setPrecedenceDfa on the DFAState rather than the DFA (or perhaps in addition too?).

Maybe we should just add an assert there.

assert t +1 < from.edges.length : "array index out of bounds!, t = " + t + ", edges.length=" + from.edges.length;

And perhaps add something about which threads are active and their stacks if it fails?

The text was updated successfully, but these errors were encountered:

sharwell · 2015-01-25T00:10:06Z

Since it's always possible for users to assign types to a token which fall outside the normal bounds defined by [EOF, atn.maxTokenType], the solution is to guard the assignment of the DFA edge. using the length of the edges array.

int index = t + 1;
if (index >= 0 && index < from.edges.length) {
    // add the edge to the DFA
    from.edges[index] = to;
}

I had not observed the issue myself because one of the memory optimizations I needed required the inclusion of this range check quite a while back. Which brings us to the actual cause....

When we added code to generate rule bypass transitions during ATN deserialization, we allowed the deserializer to define tokens which are greater than atn.maxTokenType. When one of these edges is added to the DFA, the exception reported here gets thrown.

One final note - the use of edges in setPrecedenceDfa does not require additional synchronization because the DFAState that owns it is a local variable that has not been assigned to any location visible from other threads.

parrt · 2015-01-25T00:14:12Z

Wow! You found it. excellent. I'll explore when I finish class prep.

ericvergnaud · 2015-01-25T00:19:51Z

Not sure that’s the explanation:

such a guard is already in place just before the allocation and exception:
if (from == null || t < -1 || t > atn.maxTokenType) {
return to;
}
the exception would occur deterministically

Le 25 janv. 2015 à 08:14, Terence Parr notifications@github.com a écrit :

Wow! You found it. excellent. I'll explore when I finish class prep.

—
Reply to this email directly or view it on GitHub #804 (comment).

sharwell · 2015-01-25T00:55:29Z

@ericvergnaud makes a good point.

I looked at a few other possibilities but haven't found any clear problems so far. The two areas of focus for me are:

The s0 state for precedence DFAs, which uses an edges array indexed by precedence instead of atn.maxTokenType. However, this state is never added to the states map for the DFA so it will never be the target of a transition within the DFA.
The ERROR state. If you ever fail to stop when the ERROR sentinel DFA state is reached, it will cause a big problem (it's never supposed to have outgoing edges, so it's shared across all DFAs). You could add the following to addDFAEdge, but so far I couldn't find any way for this to occur.
```
assert from != ERROR;
```

ericvergnaud · 2015-01-25T01:39:17Z

Given the context in which it occurs, I don’t think the issue lies in the algorithm itself.

I’m more suspicious with the 2 different allocation strategies for edges depending on whether the DFA is a precedence one or not, especially given the below:

if (!dfa.isPrecedenceDfa() && dfa.atnStartState instanceof StarLoopEntryState) {
    if (((StarLoopEntryState)dfa.atnStartState).precedenceRuleDecision) {
        dfa.setPrecedenceDfa(true);
    }
}

This clearly changes an existing DFA from being a non precedence one to being a precedence one, which leads to change the allocation strategy.

In a multi threaded context, you can therefore simultaneously have 1 thread executing the below in addDFAEdge:

if ( from.edges==null ) {
    from.edges = new DFAState[atn.maxTokenType+1+1];
}

from.edges[t+1] = to; // connect

and another thread executing the below in dfa.setPrecedenceStartState:

s0.edges = Arrays.copyOf(s0.edges, precedence + 1);

if the latter occurs between the former allocation and assignment, we could observe the behaviour.

This could happen as a consequence of: dfa.s0 = s0; in adaptivePredict, which is done after checking for the precedence nature of the DFA, but might have been changed in between by another thread.

In brief, I would delegate s0 assignment to DFA so it can be reliably protected and see if it improves.
I would also delegate edges allocation to DFA.

Le 25 janv. 2015 à 08:55, Sam Harwell notifications@github.com a écrit :

@ericvergnaud https://github.com/ericvergnaud makes a good point.

I looked at a few other possibilities but haven't found any clear problems so far. The two areas of focus for me are:

The s0 state for precedence DFAs, which uses an edges array indexed by precedence instead of atn.maxTokenType. However, this state is never added to the states map for the DFA so it will never be the target of a transition within the DFA.
The ERROR state. If you ever fail to stop when the ERROR sentinel DFA state is reached, it will cause a big problem. You could add the following to addDFAEdge, but so far I couldn't find any way for this to occur.

assert from != ERROR;
—
Reply to this email directly or view it on GitHub #804 (comment).

parrt · 2015-01-25T02:14:10Z

Great job, guys! Thanks. sounds suspicious as you say.

sharwell · 2015-01-25T02:19:12Z

This is one of the more... "interesting" portions of the algorithm for sure. The above situation can only be a problem if from variable in the first section and s0 variable of the second section refer to the same instance. By design, the DFAState created for a precedence DFA is not added to the DFA.states collection, which means the only way to reach this particular DFAState instance is by accessing the DFA.s0 field directly. The code path in adaptivePredict which uses the DFA.s0 field for a precedence DFA always immediately follows an outgoing edge according to the current precedence level in the parser, so there is no way for that instance to end up in addDFAEdge as the from state.

You'll find that many of these sections are not synchronized, but rather rely on deterministic algorithms to produce usable results. In other words, in places where the algorithm could read a stale value and then write a new one over the top of it, the new value is equivalent to the old value for purposes of this algorithm.

sharwell · 2015-01-25T02:24:25Z

Note that while adaptivePredict can call dfa.setPrecedenceDfa(true), there is no case where an existing DFA instance will transition from being a precedence DFA back to a normal DFA. This is a vital condition for the code flow around the block in adaptivePredict which assigns dfa.s0 = s0.

ericvergnaud · 2015-01-25T04:35:57Z

An I correct in observing that adaptivePredict is only called during testing/profiling?
If so, this may be an almost non problem i.e. we could prevent concurrent execution of adaptivePredict without impacting performance.

Le 25 janv. 2015 à 10:24, Sam Harwell notifications@github.com a écrit :

Note that while adaptivePredict can call dfa.setPrecedenceDfa(true), there is no case where an existing DFA instance will transition from being a precedence DFA back to a normal DFA. This is a vital condition for the code flow around the block in adaptivePredict which assigns dfa.s0 = s0.

—
Reply to this email directly or view it on GitHub #804 (comment).

sharwell · 2015-01-25T05:38:25Z

No, adaptivePredict is the entry point to the prediction algorithm and (along with the methods it calls) the primary focus point of almost every performance-related issue in ANTLR 4.

sharwell · 2015-01-25T05:45:43Z

adaptivePredict has two key "modes" of operation. The "warmup" path is when DFA states are getting calculated and edges are being added to the dynamically-generated DFA. The "fast" path is when future input follows edges of the DFA which were previously computed and cached. During triage for a large customer¹ working with ANTLR, it became clear that it's extremely important that no use of synchronized appear anywhere on the "fast" path.

Unlike some systems where "warmup" is strictly before the "fast" path, ANTLR 4 is generally observed to continue adding DFA states as new input is received throughout the parsing process. However, the bulk of this occurs during the "first few files", after which point the majority of decisions end up going through the DFA.

¹ The customer was implementing search functionality and using over 1,000 threads for parsing on a large multi-processor machine. With this scale, a single synchronized instance on the fast path can dominate the observed performance characteristics even if it's restricted to fine-grain locking.

ericvergnaud · 2015-01-25T05:51:50Z

Yeah it’s all over the place in generated code, but my IDE of course only sees it in non generated code.
It would still be useful to temporarily mark it as synchronized and see if the problem disappears.
This would increase our confidence on where the issue lies.

Le 25 janv. 2015 à 13:38, Sam Harwell notifications@github.com a écrit :

No, adaptivePredict is the entry point to the prediction algorithm and the primary focus point of almost every performance-related issue in ANTLR 4.

—
Reply to this email directly or view it on GitHub #804 (comment).

martint · 2015-01-26T21:13:45Z

FWIW, we're seeing this issue intermittently in our system, too:

java.lang.ArrayIndexOutOfBoundsException: 0
    at org.antlr.v4.runtime.atn.ParserATNSimulator.addDFAEdge(ParserATNSimulator.java:1838)
    at org.antlr.v4.runtime.atn.ParserATNSimulator.computeTargetState(ParserATNSimulator.java:609)
    at org.antlr.v4.runtime.atn.ParserATNSimulator.execATN(ParserATNSimulator.java:483)
    at org.antlr.v4.runtime.atn.ParserATNSimulator.adaptivePredict(ParserATNSimulator.java:423)
    at com.facebook.presto.sql.parser.SqlBaseParser.queryTerm(SqlBaseParser.java:1319)
    at com.facebook.presto.sql.parser.SqlBaseParser.queryNoWith(SqlBaseParser.java:1173)
    at com.facebook.presto.sql.parser.SqlBaseParser.query(SqlBaseParser.java:1039)
    at com.facebook.presto.sql.parser.SqlBaseParser.statement(SqlBaseParser.java:743)
    at com.facebook.presto.sql.parser.SqlBaseParser.singleStatement(SqlBaseParser.java:147)
    at com.facebook.presto.sql.parser.SqlParser$$Lambda$3/926727210.apply(Unknown Source)
    at com.facebook.presto.sql.parser.SqlParser.invokeParser(SqlParser.java:93)
    at com.facebook.presto.sql.parser.SqlParser.createStatement(SqlParser.java:66)
    ...

parrt · 2015-01-26T21:15:22Z

Hi Martin. wow, it's a party! Anything you can do to provide a (smallest possible) test that intermittently causes this issue would be much appreciated and would facilitate a quick fix. This bug concerns me greatly. Sam and Eric have already thought about this quite a bit I will jump into it as soon as I catch up after this first week of school, hopefully.

sharwell · 2015-01-26T21:28:03Z

Thanks @martint. The specific index included in the exception message made me realize what the problem is. The failure sequence is:

Thread 1 checks dfa.isPrecedenceDfa()
Thread 2 calls dfa.setPrecedenceDfa(true)
Thread 1 reads s0 as the start state, but it should have instead followed a precedence edge to find the start state

I'll send a fix later today.

parrt · 2015-01-26T21:29:44Z

Hooray! Crowd-sourcing at it's finest! :)

…ion from multiple threads Fixes antlr#804

Fixes #804 potential misuse of the DFA start state when initializing a decision

electrum · 2015-02-04T01:24:17Z

Thanks for fixing this! Any chance of getting a new release with this fix? Or if that's a big deal, getting a point release with just this fix? We hit this error quite frequently.

parrt · 2015-02-04T01:25:09Z

well it’s kind of a major transaction cost to get out a new release. how about if I provide a jar? ;)
Ter
On Feb 3, 2015, at 5:24 PM, David Phillips notifications@github.com wrote:

Thanks for fixing this! Any chance of getting a new release with this fix? Or if that's a big deal, getting a point release with just this fix? We hit this error quite frequently.

—
Reply to this email directly or view it on GitHub.

parrt added atn-analysis comp:runtime type:bug labels Jan 24, 2015

parrt assigned sharwell Jan 24, 2015

sharwell added a commit to sharwell/antlr4 that referenced this issue Jan 26, 2015

Fix potential misuse of the DFA start state when initializing a decis…

84fb456

…ion from multiple threads Fixes antlr#804

sharwell mentioned this issue Jan 26, 2015

Fix potential misuse of the DFA start state when initializing a decision #805

Merged

parrt closed this as completed in #805 Jan 31, 2015

parrt added a commit that referenced this issue Jan 31, 2015

Merge pull request #805 from sharwell/fix-804

e8c6dc4

Fixes #804 potential misuse of the DFA start state when initializing a decision

yuananf mentioned this issue Apr 10, 2015

Antlr4 error prestodb/presto#2677

Closed

parrt added this to the 4.5.1 milestone May 19, 2015

parrt added the status:fixed label May 19, 2015

rmuir mentioned this issue Oct 5, 2015

lucene-expressions fail at compiling a simple script elastic/elasticsearch#13949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random ArrayIndexOutOfBounds exceptions coming from ParserATNSimulator #804

Random ArrayIndexOutOfBounds exceptions coming from ParserATNSimulator #804

parrt commented Jan 24, 2015

sharwell commented Jan 25, 2015

parrt commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

parrt commented Jan 25, 2015

sharwell commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

sharwell commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

martint commented Jan 26, 2015

parrt commented Jan 26, 2015

sharwell commented Jan 26, 2015

parrt commented Jan 26, 2015

electrum commented Feb 4, 2015

parrt commented Feb 4, 2015

Random ArrayIndexOutOfBounds exceptions coming from ParserATNSimulator #804

Random ArrayIndexOutOfBounds exceptions coming from ParserATNSimulator #804

Comments

parrt commented Jan 24, 2015

sharwell commented Jan 25, 2015

parrt commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

parrt commented Jan 25, 2015

sharwell commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

sharwell commented Jan 25, 2015

sharwell commented Jan 25, 2015

ericvergnaud commented Jan 25, 2015

martint commented Jan 26, 2015

parrt commented Jan 26, 2015

sharwell commented Jan 26, 2015

parrt commented Jan 26, 2015

electrum commented Feb 4, 2015

parrt commented Feb 4, 2015