-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET 8 crash with "Fatal error. Failed to create RW mapping for RX memory" #97316
Comments
I know this isn't much info to go on. I've installed |
@loop-evgeny this error occurs when |
@janvorli I see, thanks for the explanation. Correct, we're not running in a container. This is a rather cryptic error message, so would be nice to include in the message what you wrote here, like "Failed to create RW mapping for RX memory. This may be caused by running out of memory or out of memory mappings - check the vm.max_map_count setting on Linux or (whatever) on Windows". But it seems like we're only the second ones to run into it. We haven't changed max_map_count, so it's at the default value of 65530. But is there a downside to increasing this? Why is there even a limit? |
I checked the |
There is no downside. The setting has no effect on applications that use less mappings and enables proper execution of applications that use more. |
I've been monitoring the count of maps for that process for a week now and while it hasn't gone over 65K yet, it is steadily increasing. RAM usage goes up during the daily data loading (smaller than usual last week), then down again, but the number of memory map areas does not go down signficantly. It started around 19K a week ago and is now at 32K. Can there be a "leak" in that somehow - without an obvious memory leak? |
We finally had a few days where memory usage went > 300 GB and the number of maps went over 80K. It then went down together with RAM usage, to ~34K. So it seems like there is no "leak" and increasing It would be good if the error message explained what the likely problem is, though. There is no way the average developer troubleshooting a crash will know that "Failed to create RW mapping for RX memory" means "Either you're out of memory or you need to increase vm.max_map_count". |
@loop-evgeny thank you for the suggestion, that makes sense. When I have added that error message, I didn't realize that the max map count can be also causing the issue. I'll update the message along the lines of what you've suggested. |
We do think there is a change with how .net 8 allocates/uses process memory. This could "by design": then the change in behavior was breaking that required a notice. Reproducing should be fairly easy: run 2 processes (one in .net 7 and another .net 8) in a stress test environment allocating managed and unmanaged memory and recording proc/self/maps (VAS) counts, then compare if there is a significant difference. Our environment:
To illustrate the crash pattern after .net 8 upgrade and before vm.max_map_count increase: We will provide more information as deemed necessary and also going to do a comparison stress test run. |
@baal2000 the growth in number of memory mappings between the .net 7 and 8 is significant only if there is a large amount of managed code being generated on the fly. We have never hit this in our internal testing as it requires quite specific behavior of the application. I agree that we should document somewhere that when you experience issue like this, the vm.max_map_count needs to be updated. The value can be set to any large value, enlarging it doesn't result in any additional growth in memory consumption other than one related to the needed growth of the number of mappings required by .NET. So, you don't really need to figure out some optimal value, you can e.g. set it to 100 times the default and be good. |
Have the team profiled proc/self/maps counts? We do not necesserily need to "hit" an issue.
Could you elaborate and point to a specific area inside the framework that now allocates differently than under the old framework? |
We have not, but the high count is not necessarily a problem per se, so we had no reason to do that. And we were not aware of the relatively low default limit value, which would probably made us consider this being a problem until people reported it in this issue.
The write xor execute feature to prevent code memory being executable and writeable at the same time was caused the difference in the memory mapping pattern. There are several kinds of small stubs that are created for methods that are called by managed code, but were not compiled / resolved yet and for call counting to allow us dynamically re-jit methods on hot paths with more optimizations (this is called tiered compilation). The memory for these stubs is allocated as pairs of blocks of memory, one read-execute for the code of the stubs and one read-write for the writeable data of the stubs. This is what causes the large number of mappings in case of a lot of methods, because these blocks are effectively interleaved in memory, so, each of them requires a separate memory mapping. These blocks are 16kB long. So e.g. for FixupPrecode stubs, each stub is 24 bytes long, so stubs for 682 methods fit into one pair of blocks. The call counting stubs 32 bytes long so 512 method stubs fit into each block. |
@janvorli
|
@baal2000 the write xor execute can be turned off by setting the env var DOTNET_EnableWriteXorExecute=0.
Then it is probably not related to the write xor execute and the stubs I was talking about. Could you share a smap of a process with a large number of mappings? That could shed some light on the problem. |
/proc/PID/smaps file? |
Yes, please. Please feel free to trim any filenames from it in case they are sensitive. |
On another hand, as mentioned in my first message due to .net 7 GC segfault issue we configured .net 7 with Update: |
In the meanwhile, @baal2000 has shared with me some smaps / maps logs. I was surprised to see that there are many mappings that are adjacent in the virtual address space, that have the same protection and all flags and yet the kernel has not merged them. All of them were multiple of 4MB large. |
I've forgotten to mention how to set the GC region size to 16MB. Setting the |
Is increasing GC region size worth doing for us as well? We haven't hit the issue since increasing |
@loop-evgeny it would be best if you'd tried that with your app and based your decision on the real world perf results of it. We don't have any data on performance differences with different GC region sizes. My expectations would be that there would be no measurable difference, however it is always better to measure using the specific scenario to see what works best for you. And if you try that, please let us know about the results, as many people would benefit from that. |
@loop-evgeny Update: 1 week later |
At the end of the day, the Linux kernel is responsible for creating and merging these virtual memory mappings. As such, I did a study on the Linux kernel to see how it works. As part of the study, I wrote some notes on what I have found. In short, I think there is a missed opportunity there such that the Linux kernel can do a better job in our scenario. https://cshung.github.io/posts/linux-virtual-memory-mapping-debugging/ |
In the foreseeable future, we still need segments for 32 bit platforms, so we will continue to release |
After thinking a bit about your findings: I feel that we should not lay all the blame at Linux's feet. This issue should be followed up by a more practical step of "what we could do" and not "what they (Linux) should". For instance, deciding on a different default region size (currently standing at 4MB), or some other parts of the GC implementation. |
I agree that we should try to do as much as we can, since even if a fix got into the Linux kernel now, it would take time to become mainstream. And since there was a Linux kernel patch that was trying to fix that problem in the past and it was rejected for reasonable reasons, I am a bit skeptical about it being changed in a foreseeable future. We already have plans to run some performance benchmarks with different region sizes to see if it has any perf impact. I have also investigated a possibility of different pattern of accessing the memory to lower the probability of kernel will fail to merge the blocks. After we know the influence of the region sizes on the perf, we could decide to automatically pick the region size based e.g. on the total amount of available memory. However, I still haven't seen evidence that the large number of mappings that can result from the current state cause performance problems. The mapping count limit can be raised to a larger value to mitigate the OOMs. The default limit is quite conservative. I don't actually feel like we are aggressively committing / decommitting memory. We are just doing that on demand of the application - the more it allocates, the more we need to commit. |
@janvorli thanks for the work and the progress your team is making to address this. I saw the updated error message #102458 but that could be too late to find out in the app development cycle. Other times the process crashes with no message at all other than "General Protection Fault" in kernel logs or a simple "out of memory" message, not sure why. These are all related though because none happens once the limit is raised in the OS.
Could you propose the change in Linux kernel repo? If there are no downside it is going to be merged in no time and the issue could be closed. |
Thanks for following up!
Just to reiterate, I said "I think there is a missed opportunity there for Linux to improve", I am not blaming them. This is probably just a scenario they never envision, and we should enlighten them on that.
The region-based GC isn't aggressive in terms of volume (we are not committing too much memory over what the application need), nor it is aggressive in terms of frequency (we have optimization in place to avoid frequent commit/decommit calls). It is the random sequence of committing that is choking the underlying OS. Typical stack grows one way. Typical To allow various optimizations, regions works by allowing multiple regions to grow independently, so think of that as many streams going on and they all go one way towards the right. Think of that like an old school parallel download app. At the end of the download, you expect one region, but no, on Linux, these stream don't merges and they have as many as the number of streams ever created. This story starts with having 5N streams, where N is the number of cores and 5 is the number of generations. Once any of these streams ends, we create new ones. But then, by the time every of the initial 5N stream ends, all the earlier gaps should have been filled, and therefore we expect the number of streamed parts should stay roughly 5N to 10N. But the number we observed from the maps is a 100 fold more than that, and that is because when end meets, Linux cannot merge them. So it ends up being the total number of streams created so far, these accumulate over times and therefore we have this issue. All these have a caveat, that is based on my research on the Linux code base for just a few days, it might very well be wrong.
I agree with that, but retrofitting regions to fit the Linux way of one way committing is simply not an option. There are already various workaround we proposed in this thread, here they are (in preference order):
And all of these do not require a change in the runtime code, they are all configurable. Except 3 (which actually grow the memory one way), both 1 and 2 didn't address the underlying issue the end of streams doesn't merge, all it does is either:
Eventually, with bigger apps, these limit will hit again. IMO, this is just fixing the symptoms. What we really want to understand is the consequence of 1 and 2. We suspect 1 might impact memory access latencies, but do we have data? The dependency of memory access latency on number of memory mapping should be logarithmic, so I expect even if we 4x the number of memory mappings, we will have at most two more memory accesses for each page fault. In the grand theme of things does that matter at all? We experimented with 2, and there is a GC behavioral change. As of now, this is still mysterious to us, do we have data where we can use the analyze what is going on? |
This nails it, thank you.
Interesting comparison, with the only difference is that with the partial download streams "we" create the streams and "we" do the re-assembly at the end. In the regions GC scenario "we" do the distributed allocations to achieve better throughput, yet have no control over the vm maps re-assembly. Not saying this is wrong: this is to agree with the statement that this hasn't been expected and modeled.
The sysctl documentation does not say much about the max_map_count purpose other than stating what the value is and the assumption it is sufficient for standard use scenarios. RedHat claims that the limit has something to do to letting kernel more access to its lowmem: Found this explanation in the original linux kernel repo: it refers to the Nowhere can I find a word about the memory access performance though. Maybe a prudent thing to do for now would be documenting this similarly to Elasticsearch recommendation of max_map_count at least 262144 to prevent out-of-memory exceptions: elastic.co. This is the value we also picked for our servers to stop the incidents. Not pretending for the best wording, but the spirit of the message could be:
|
Just as an FYI - this might have to do with non-paged pool. Just like user mode, kernel mode can also use virtual memory. However, with virtual memory, you can have a block of memory in contiguous virtual addresses that is not contiguous in physical addresses, this upset direct memory access (DMA) for devices like secondary storage or network cards. Therefore the kernel has a particular pool of memory that is restricted so that it can guarantee you have contiguous physical memory. This make it possible for use of DMA. Because the virtual memory area are used to handle page fault, they probably need to be stored in that area too because you don't want to handle page fault with memory that can also result in a page fault. That non-paged pool is probably a precious resource of its own, that is probably why we had the conservative limit. |
Also an FYI we are not alone: a similar issue reported for with ZGC in JVM on SO forum. They do provide an early warning about the imminent max maps value overflow:
|
FYI it's best to avoid any references to JVM and such here due to its copyleft licensing. |
... can't apply to one forum user quoting another forum user's error message. But thanks for the reminder. Now steering back to what is the main topic here, i.e. unexpected application crush under region-based GC. |
I am moving this issue to "future" as there is nothing we can do for .NET 9 and the main topic of this issue has become a discussion on the memory mapping merging. |
Description
We have many instances of our ASP.NET Core application for different customers, each running as a systemd service. One specific instance has crashed twice with "Fatal error. Failed to create RW mapping for RX memory".
The last time was on 2023-Nov-30, so this doesn't happen often. It may be because this instance of the application is particularly large, using ~300 GB RAM while loading a lot of data (once per day) and ~50 GB at other times.
Reproduction Steps
Not reproducible. Happened 2 times so far.
Expected behavior
No crash
Actual behavior
Crash with the systemd journal containing:
(I don't have the stack trace from the first crash, unfortunately.)
Regression?
No response
Known Workarounds
No response
Configuration
.NET 8.0.0 x64, self-contained application
Ubuntu 22.04.3 this (second) time, Ubuntu 18.04.6 the first time
Running as a systemd service
The machine has 1TB RAM, but only ~36% of it was used at the time of the crash.
Other information
The only reference I can find to this error is in #80580
The last comment there includes code to reproduce it by creating many dynamic assemblies, but that code runs successfully for me on the same machine (as well as other machines), both on .NET 8 and .NET 6. We do create some dynamic assemblies via
CSharpCompilation.Emit()
but I don't think we create thousands of them (but even if we did I would not expect .NET to crash like that).The text was updated successfully, but these errors were encountered: