Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go runtime performance fixup #2916

Closed

Conversation

dmitrys99
Copy link

Main performance issue in Go runtime comparing to Java runtime is absence of hashCode() method in Go world. It is emulated using hash() method which is called directly or indirectly significant amount of times. But instead of returning precalculated hash it actively calculates actual hash on each call. This leads to significant performance degradation.
This PR reduces time required to calculation to about 7 times.

@dmitrys99
Copy link
Author

Further investigation shown real problem is GC and pointers: https://syslog.ravelin.com/further-dangers-of-large-heaps-in-go-7a267b57d487
So, to eliminate performance issues we have to rework Go target runtime and decrease (or remove if possible) usage of pointers in runtime datastructures.

@exander77
Copy link

@dmitrys99 Is it in a viable state for testing? I see Golang performance issues as well.

@dmitrys99
Copy link
Author

@dmitrys99 Is it in a viable state for testing? I see Golang performance issues as well.

I added several commits, now compilation goes significantly faster.
To issue is that ATNConfig has 3 pointer and GC goes crazy checking those pointers.
I eliminated one of the 3 pointers, but have no clue how to deal with others two.

You can try to test with this PR, it does things faster, but not as Java runtime does.

@exander77
Copy link

exander77 commented Nov 12, 2020

@dmitrys99 I have tested it and it seems to cut 30% from the parsing time (in my case).

@exander77
Copy link

@parrt Can this be pulled into master? I can confirm significant speed up and maybe other people can join the effort.

@parrt
Copy link
Member

parrt commented Nov 12, 2020

@davesisson I think you're the Go guy. can you take a look?

@exander77
Copy link

exander77 commented Nov 12, 2020

This pull makes it much better, but if there is a Golang expert, I would like some opinion why it is spending so much time in garbage collector? Can garbage collector be disabled during the parsing and memory cleared once after parsing?

runtime.scanobject itself is 11.02%.

      flat  flat%   sum%        cum   cum%
    14.33s 11.02% 11.02%     27.54s 21.18%  runtime.scanobject
     9.18s  7.06% 18.08%     60.66s 46.65%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).closureWork
     7.13s  5.48% 23.56%     23.98s 18.44%  runtime.mallocgc
     4.84s  3.72% 27.28%     25.85s 19.88%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).getEpsilonTarget
     4.69s  3.61% 30.89%      6.54s  5.03%  runtime.findObject
     4.06s  3.12% 34.01%      5.41s  4.16%  runtime.heapBitsSetType
     3.61s  2.78% 36.79%      6.89s  5.30%  runtime.cgocall
     3.25s  2.50% 39.29%      3.25s  2.50%  runtime.memclrNoHeapPointers
     2.37s  1.82% 41.11%      2.40s  1.85%  syscall.Syscall
     2.33s  1.79% 42.90%      2.33s  1.79%  runtime.nanotime (inline)
     2.31s  1.78% 44.68%      2.31s  1.78%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseATNConfig).GetState
     2.04s  1.57% 46.25%      2.04s  1.57%  runtime.nextFreeFast
     1.94s  1.49% 47.74%      1.94s  1.49%  runtime.memmove
     1.92s  1.48% 49.22%        13s 10.00%  github.com/antlr/antlr4/runtime/Go/antlr.NewBaseATNConfig
     1.80s  1.38% 50.60%      5.56s  4.28%  runtime.greyobject
     1.75s  1.35% 51.95%      1.75s  1.35%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseATNState).GetTransitions
     1.63s  1.25% 53.20%     28.95s 22.26%  runtime.gcDrain
     1.60s  1.23% 54.43%      7.73s  5.94%  runtime.mapassign_fast64
     1.55s  1.19% 55.62%      1.79s  1.38%  runtime.heapBitsForAddr (inline)
     1.39s  1.07% 56.69%      1.52s  1.17%  runtime.(*pallocBits).summarize
     1.33s  1.02% 57.71%      5.15s  3.96%  bufio.(*Reader).ReadSlice
     1.33s  1.02% 58.74%      1.33s  1.02%  runtime.markBits.isMarked (inline)
     1.32s  1.02% 59.75%      1.80s  1.38%  runtime.spanOf
     1.31s  1.01% 60.76%     60.66s 46.65%  github.com/antlr/antlr4/runtime/Go/antlr.(*ParserATNSimulator).closureCheckingStopState
     1.31s  1.01% 61.77%      1.77s  1.36%  runtime.mapaccess1_fast64

@dmitrys99
Copy link
Author

I tried to turn off GC, it does not help, unfortunately. It consumes about 7 GB of data on test case.

@exander77
Copy link

exander77 commented Nov 12, 2020

@dmitrys99 I think the problem is not the pointers, but a lot of memory allocation on heap.

Each of these calls allocates memory and thus creates another item for the garbage collector to handle:

      flat  flat%   sum%        cum   cum%
 8704.53kB 36.09% 36.09%  8704.53kB 36.09%  github.com/antlr/antlr4/runtime/Go/antlr.NewBaseATNConfig
 4608.14kB 19.11% 55.20%  4608.14kB 19.11%  github.com/antlr/antlr4/runtime/Go/antlr.NewBaseSingletonPredictionContext
 3100.11kB 12.85% 68.06%  3100.11kB 12.85%  github.com/antlr/antlr4/runtime/Go/antlr.(*BaseATNConfigSet).Add

I did modification like this (basically allocating memory in bulk):

+var prealocatedATNConfigCount = 512
+var prealocatedATNConfigMax = prealocatedATNConfigCount*2*2*2*2
+var prealocatedATNConfigIndex = prealocatedATNConfigCount
+var prealocatedATNConfigArray []BaseATNConfig
+
 func NewBaseATNConfig(c ATNConfig, state int, context PredictionContext, semanticContext SemanticContext) *BaseATNConfig {
 	if semanticContext == nil {
 		panic("semanticContext cannot be nil")
 	}
 
-	return &BaseATNConfig{
+	prealocatedATNConfigIndex++
+
+	if prealocatedATNConfigIndex >= prealocatedATNConfigCount {
+		if prealocatedATNConfigCount < prealocatedATNConfigMax {
+			prealocatedATNConfigCount *= 2
+		}
+		prealocatedATNConfigArray = make([]BaseATNConfig, prealocatedATNConfigCount)
+		prealocatedATNConfigIndex = 0
+	}
+
+	prealocatedATNConfigArray[prealocatedATNConfigIndex] = BaseATNConfig{
 		state:                      state,
 		alt:                        c.GetAlt(),
 		context:                    context,
@@ -101,6 +116,8 @@ func NewBaseATNConfig(c ATNConfig, state int, context PredictionContext, semanti
 		reachesIntoOuterContext:    c.GetReachesIntoOuterContext(),
 		precedenceFilterSuppressed: c.getPrecedenceFilterSuppressed(),
 	}
+
+	return &prealocatedATNConfigArray[prealocatedATNConfigIndex];
 }

Ant it cut another 25% off the time on top of your patch, but now I leak memory for some reason. Seems like prealocatedATNConfigArray does not get garbage collected. Any ideas?

@exander77
Copy link

Hm, maybe it does not leak but has slightly larger memory use. You can check if it works for you better.

@dmitrys99
Copy link
Author

Hm, maybe it does not leak but has slightly larger memory use. You can check if it works for you better.

Thank you for your attempt to fix the issue!

It might be a solution yet I see couple of drawbacks.

  1. Such a fix will not work if you'll try to run several parsers simultaneously, which is true in my case.
  2. You get speedup because in fact you do not produce garbage, you store links to all objects, which is fine, but lead to additional memory consumption.

The real solution should be based on knowledge of Antlr internals. I have questions, which I have no idea how to answer.

  1. When we enter AdaptivePredict procedure, which is high-level consumer of memory here, can we cache ATNConfigs at this level? We definitely can try, but I do not know if some of the ATNConfig "leak" to ATN network. If it is true, then whole schema will fail, or, we have to rework ATN internals to deal with additional cache.

  2. When cleanup actually called and how it can be modelled in Go? C++ uses smart pointers (i.e. destructors); Java, C#, JS and PHP uses generational GC, which has no "pointer" issue I mentioned above.

  3. What is the scope of cleanup? What exact object can be dropped during cleanup?

Knowing this information will allow us to model cleanup procedure in Go in a way it can be performant yet be different from Java's way. Since @parrt is on the thread, may be this questions are to you.

@exander77
Copy link

Hm, maybe it does not leak but has slightly larger memory use. You can check if it works for you better.

Thank you for your attempt to fix the issue!

It might be a solution yet I see couple of drawbacks.

  1. Such a fix will not work if you'll try to run several parsers simultaneously, which is true in my case.
  2. You get speedup because in fact you do not produce garbage, you store links to all objects, which is fine, but lead to additional memory consumption.

I think it would work with several parsers running parallel, but nodes from them would mix into the single allocated memory and the memory would be released only when all nodes from all parsers in it would be released. It could be revoked that memory would be allocated in the parser and all nodes in such block would belong to the same parser. I was basically testing different approaches so it is not a perfect solution.

The real solution should be based on knowledge of Antlr internals. I have questions, which I have no idea how to answer.

  1. When we enter AdaptivePredict procedure, which is high-level consumer of memory here, can we cache ATNConfigs at this level? We definitely can try, but I do not know if some of the ATNConfig "leak" to ATN network. If it is true, then whole schema will fail, or, we have to rework ATN internals to deal with additional cache.
  2. When cleanup actually called and how it can be modelled in Go? C++ uses smart pointers (i.e. destructors); Java, C#, JS and PHP uses generational GC, which has no "pointer" issue I mentioned above.
  3. What is the scope of cleanup? What exact object can be dropped during cleanup?

Knowing this information will allow us to model cleanup procedure in Go in a way it can be performant yet be different from Java's way. Since @parrt is on the thread, may be this questions are to you.

I agree that knowledge of Antlr internals and behaviour of the golang garbage collector behaviour is the key here.

@KvanTTT
Copy link
Member

KvanTTT commented Nov 7, 2021

@jjeffcaii could you take a look if it has already been fixed by your improvements or it's another problem?

@jjeffcaii
Copy link

@jjeffcaii could you take a look if it has already been fixed by your improvements or it's another problem?

They look similar, but I'm not sure, I think they are all aim to resolve the hash problem, the modification of murmur hash is the same.

@dmitrys99
Copy link
Author

I have tested it with 4.9.3. On my side performance increase about 2 times (21 sec vs 49 sec).

This is a good improvement.

But my previous experience shows there might be 7 times. I'll try to combine these two approaches and see.

@dmitrys99
Copy link
Author

dmitrys99 commented Nov 12, 2021

I have tested test case from #2888. There are times:

Item Time, sec
Original issue (Antlr 4.8) 37.2
This MR (Antlr 4.8) 17.8
Antlr 4.9.3 22.3

So, while there is an improvement, I still think the problem with Golang target if fundamental. Please take a look to #2888 (comment), there is an explanation.

Moreover, performance is unstable (previously I got better timing), because it depends on Go GC implementation.

I see no other means except Go target redesign.

@KvanTTT
Copy link
Member

KvanTTT commented Nov 12, 2021

It seems like this MR should had to be merged instead of the newest one.

@parrt
Copy link
Member

parrt commented Dec 27, 2021

@jcking Could you take a look at this and see how we might combine efforts here on performance? @dmitrys99 could you Provide the test rig that checked performance?

@jimidle
Copy link
Collaborator

jimidle commented Aug 22, 2022

@parrt It might be worth someone checking the performance against the current dev branch after my performance fixes. I suspect that this is now fixed as the need for garbage collection has been improved drastically. We can then close this issue.

@dmitrys99
Copy link
Author

@parrt It might be worth someone checking the performance against the current dev branch after my performance fixes. I suspect that this is now fixed as the need for garbage collection has been improved drastically. We can then close this issue.

I did. I can confirm issue test executes 1.6 s on dev branch of 4.10.2 against 25.7 s on 4.8 version.

@dmitrys99
Copy link
Author

@parrt It might be worth someone checking the performance against the current dev branch after my performance fixes. I suspect that this is now fixed as the need for garbage collection has been improved drastically. We can then close this issue.

Very interesting to hear how did you do that!

@jimidle
Copy link
Collaborator

jimidle commented Aug 22, 2022 via email

@parrt
Copy link
Member

parrt commented Aug 23, 2022

Should we close as @jimidle's PRs fixed performance or is this still needed?

@jimidle
Copy link
Collaborator

jimidle commented Sep 12, 2022

#2888

@parrt I think we can close this now. Any further performance related stuff will be new issues. Although I have some tweaks to do that will gradually improve both the code and performance.

@dmitrys99
Copy link
Author

Closed. Development goes to #2888

@dmitrys99 dmitrys99 closed this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants