Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Exception handling performance #77568

Closed
9 tasks done
janvorli opened this issue Oct 27, 2022 · 8 comments
Closed
9 tasks done

[Experiment] Exception handling performance #77568

janvorli opened this issue Oct 27, 2022 · 8 comments

Comments

@janvorli
Copy link
Member

janvorli commented Oct 27, 2022

Historically, we have considered exception handling performance to not to be important. The point was that exceptions should be used for exceptional cases handling and thus they should not have significant impact on well behaved applications. However, a new scenario where this is not true has surfaced recently. When a service depends on a resource that is inherently unreliable, for example a computer network, random temporary failures in such a resource can trigger a failure storm in the service and potentially other services that depend on that service. In this case, the performance of the exception handling is very likely an important factor for fast recovery of these services.
Moreover, the fact that such cases would most likely involve async patterns, the exception handling performance issues get amplified by the fact that the exceptions are rethrown on async state transitions.
A customer has reported that JAVA exception handling is about 10 times better on the same hardware (#12892). We also know that native AOT has much better exception handling performance than CoreCLR.
Goals

  • Measure performance of exception handling in various scenarios
    • Software / hardware exceptions
    • Sync / async code
    • Windows / Unix
    • CoreCLR / native AOT
  • Profile the CoreCLR scenarios and determine if there are opportunities for improvements within the confines of the current implementation.
  • Investigate what it would take to leverage the native AOT exception handling architecture in CoreCLR.
  • Estimate the cost of potential changes.
  • Write and publish a learning document on the found details and suggested improvements.
@janvorli janvorli added this to the 8.0.0 milestone Oct 27, 2022
@janvorli janvorli self-assigned this Oct 27, 2022
@todor-dk
Copy link

todor-dk commented Feb 13, 2023

Very good initiative.

Quite some years ago I tried to implement a Smalltalk dialect on the .Net DLR. One of the issues is the flow in a language like Smalltalk that requires many non-local returns and stack-unwinds. The way to solve this is with exceptions, but performance dies. The IronRuby implementors worked this around by implementing light-weight exceptions in their language, but those are not really exceptions, but return values.

Anyway, if I may put my two cents in, there should be a way to throw exceptions without:

  • Overhead of capturing the stack trace.
  • Overhead of reporting Windows Events.
  • Other diagnostic or unnecessary allocations.

One option would be to categorize the exception in two categories:

  1. Real exceptions.
  2. Operational exceptions.
    The first are real errors - nothing can be done there. The second are just the abnormal way for a piece of code to end execution, e.g. a ParseException or EndOfDataException.

Another option is to let the caller decide how he wants to handle the exception. For example:

try
{
    throw new NoDataException();
}
quichcatch(NoDataException ex)
{
    // No stack trace in ex.StackTrace
    MessageBox.Show("No data.");
}
catch(Exception)
{
     ....
}

During first-chance exception handling, you could examine the handler if they care about the full exception details, or like the example, they just want to do a quick catch.

And if we are on the subject of exceptions, I know this is a major thing, but please consider resumable exceptions. Resumable exceptions can be thrown, but the first-chance exception handler can examine the exception and decide to continue with the next statement without unwiding the stack.

@janvorli
Copy link
Member Author

janvorli commented Mar 3, 2023

Exception handling performance

Hypothesis

Historically, we have considered exception handling performance to not to be important. The point was that exceptions should be used for exceptional cases handling and thus they should not have significant impact on well behaved applications. However, a new scenario where this is not true has surfaced recently. When a service depends on a resource that is inherently unreliable, for example a computer network, random temporary failures in such a resource can trigger a failure storm in the service and potentially other services that depend on that service. In this case, the performance of the exception handling is very likely an important factor for fast recovery of these services.
Moreover, the fact that such cases would most likely involve async patterns, the exception handling performance issues get amplified by the fact that the exceptions are rethrown on async state transitions.
We know that native NativeAOT has much better exception handling performance than CoreCLR.

Goals

  1. Measure performance of exception handling in various scenarios
    • Software / hardware exceptions
    • Sync / async code
    • Windows / Unix
    • CoreCLR / native NativeAOT
  2. Profile the CoreCLR scenarios and determine if there are opportunities for improvements within the confines of the current implementation.
  3. Investigate what it would take to leverage the native NativeAOT exception handling architecture in CoreCLR.
  4. Estimate the cost of these changes.

Testing methodology

To test performance in all the scenarios and enable easy profiling, a standalone application was created. This application is parametrized by the various execution modes, like the software / hardware exceptions, sync / async code, single and multi-threaded execution of the exception handling and the depth of the stack the exception is propagated through. A C++ version was also created to allow broader comparison. These versions support almost all the modes mentioned before except for the async code and except hardware exceptions in case of C++.
The testing application executes 1000000 iterations of exception being thrown and caught over the specified amount of stack frames. The time to execute those iterations is measured and reported. In async mode, we consider the stack “frame” to be the application’s code frame, not taking into account the internal implementation of the async machinery. For the multi-threaded mode, number of threads equal to the number of physical CPU cores is used to run the same exception handling in parallel. The execution time was measured per thread and then averaged.
The application was executed in all the combinations of the modes mentioned above for 0, 1, 2, 10 and 100 frames stack depth between the throw and catch. The hardware used was a core i9 9900x CPU running at 3.5GHz. 10 threads were used to execute the multi-threaded scenarios, since the processor has 10 physical cores. For testing on Linux, Ubuntu 20.04 installed in the WSL2 on the same machine to enable 1:1 comparison of the performance results was originally used. Early in the process, it was discovered by accident that the performance of exception handling in the multi-threaded scenarios has significantly improved on Ubuntu 22.04, so the testing has switched to that version instead.
After all the measurements were done, it became obvious that that async performance is roughly equal to the sync performance of exception handling over a single frame multiplied by the number of frames. That is not surprising, because in the async code, the exception is caught and rethrown at every async call boundary. So it was decided to leave out the async results from the further comparison and reports.
To get additional support for reasoning about the measured values, sampling profiler was used to measure the .NET scenarios. The details are captured in the profiling section of this document.

Performance measurements

Absolute performance

The measured values are captured in the graphs below. The horizontal axis shows the number of frames of the stack the exception was propagated through, the vertical axis shows the time in milliseconds it took to execute the 1,000,000 iterations.
Please note that the Windows graphs use hollow “dots” on the graph so that they can be easily distinguished from the Linux ones.
The two graphs show just the measured results, the first one for the single-threaded scenario, the other for the multi-threaded one.

Single-threaded

From the single-threaded scenario graph, it is obvious that on Linux, both the CoreCLR and NativeAOT perform much worse than on Windows. Also, Windows NativeAOT seems to scale much better with the stack depth than the Windows CoreCLR, but on Linux, the NativeAOT and CoreCLR curves are more or less parallel, having similar slope to the Windows CoreCLR.
SingleThreadedSync

Multi-threaded

The multi-threaded scenario is quite similar to the single-threaded one except for the Linux NativeAOT. Except for the case of throwing and catching at the same frame (depth being 0), the CoreCLR is actually better than the NativeAOT and also scales much better. And Windows and Linux CoreCLR are performing in about the same way.
MultiThreadedSync

Relative performance

This section compares the measured results shown in the previous section for various scenarios, like multi-threaded vs single-threaded, CoreCLR vs NativeAOT, Linux vs Windows and hardware vs software exceptions.

Multi vs single-threaded

The graph below shows a comparison between the multi-threaded and single-threaded scenarios. The vertical axis shows the ratio between the multi-threaded and single-threaded measured times. Just as a reminder, the multi-threaded times are measured per thread and averaged. In an ideal world, the single multi-threaded numbers would be the same, but in reality they are different, which shows scalability of the exception handling.
The Linux and Windows CoreCLR are in a similar ballpark, but the NativeAOT scenarios have quite different behavior on Windows and Linux. On Linux, the NativeAOT basically doesn’t scale – it is 10 times slower than the single-threaded one. As mentioned before, it was running on 10 threads on a machine with 10 physical cores.
On Windows, the NativeAOT scales badly for small stack depths, but it gets better and better with growing stack depth, at the measurement point of 10 frame depth, it is already the best of all the other .NET scenarios.
MultiVsSingleThreaded

CoreCLR vs NativeAOT

The following graph compares CoreCLR with the NativeAOT. You can see that on Linux, the NativeAOT is about 4 times faster than CoreCLR in the multi-threaded scenario when the exception is thrown and caught in the same frame, but with growing stack depth, they quickly become performing the same. In the single-threaded scenario, the curve is similar, only the initial difference is much larger (CoreCLR being 14 times slower than NativeAOT) and becomes about twice as slow at the 10 frames depth.
On Windows, the behavior is different. For single-threaded case, the NativeAOT is always about 4 times faster than the CoreCLR, independent of the stack depth. For the multi-threaded case, the NativeAOT starts at about twice as fast than the CoreCLR, but it becomes better and better with the growing stack depth.
CoreCLRVsNativeAOT

Linux vs Windows

The following two graphs compare Linux and Windows performance for the same scenarios. The vertical axis represents the ratio between the Linux and Windows execution times.
Regarding the single-threaded performance, the NativeAOT on Linux is better than on Windows for the case when throw and catch are in the same frame, but with growing depth of the stack to unwind, it becomes quickly worse, approaching 4 times worse performance. The CoreCLR is different, starting at being twice as slow and then getting slightly better gradually. Both indicate a worse stack unwinding performance of the libunwind vs the Windows style unwinder.
LinuxVsWindowsSingleThreadedSync
The multi-threaded case is similar to the single-threaded one, except that for NativeAOT, the difference is much more profound. That most likely means that in addition to the unwinding using libunwind being slower than the unwinding using Windows style unwinder, we may have a scaling issue in the NativeAOT stack unwinding code. Also the C++ version performance difference being much smaller on Linux than with NativeAOT may indicate that the LLVM libunwind that NativeAOT uses for stack unwinding on Linux is performing worse than the unwinder in the C++ runtime (glibc).
LinuxVsWindowsMultiThreadedSync

Hardware vs Software Exceptions

The following two graphs compare performance of software and hardware exception handling. For the hardware exception handling, a null reference exception generated by an access to a property of a null string was used.
In the single-threaded scenario, you can see that in Linux CoreCLR, the hardware exception handling is better by 20-30% for small stack depths and it becomes the same for 100 frames stack depth. It is most likely due to the fact that hardware exception handling doesn’t use any C++ exception handling on contrary to the software exception handling that throws PAL_SEHException to initialize the exception handling. At higher stack depths, the C++ handling price gets amortized and so the software and hardware exceptions performances converge.
For the Windows CoreCLR the hardware exception handling is performing worse than the software one. For small stack depths, it is about twice as slow, but with growing number of frames it actually converges to be the same. It makes sense as it shows the overhead of executing the vectored exception handler. The larger the number of frames the exception needs to propagate through, the more that cost gets amortized.
The Linux NativeAOT behaves very similarly except for the case of a throw and catch within the same frame (stack depth being zero).
An outlier is the Windows NativeAOT where for the case of a throw and catch within the same frame the hardware exception handling is 4 times slower than the software one. Then it gets better with the growing number of frames the exception needs to propagate through, but it still ends up being 20% slower than the software one for 100 frames stack depth.
The outlier case was profiled and diffed using perfview, the details are in the profiling section later in this document.
HardwareVsSoftwareSingleThreaded
For multi-threaded scenarios, the Linux CoreCLR and Windows CoreCLR behave roughly the same way as for the single-threaded ones. The Linux NativeAOT is actually better, the hardware exception handling performance being about the same as the software exception handling. The difference of Windows NativeAOT hardware exception handling and the software exception handling performance is much smaller than in the single-threaded case.
HardwareVsSoftwareMultiThreaded

Profiling

After reviewing the measured results, sampling profiler was used to profile the single and multi-threaded scenarios where a difference between Linux and Windows or CoreCLR and NativeAOT was significant. The depth of 10 frames was used for the profiling. On Windows, PerfView was used to capture the profile. On Linux, the perf tool served the same purpose.
The profiling has revealed some interesting details that are described in the following paragraphs.

Linux CoreCLR – single-threaded

These are the top exclusive samples for single-threaded synchronous case with exception propagation depth of 0 frames:

Overhead Shared Object Symbol
+ 8,29% libgcc_s.so.1 [.] _Unwind_Find_FDE
+ 2,51% libc.so.6 [.] __memmove_sse2_unaligned_erms
+ 2,14% ld-linux-x86-64.so.2 [.] _dl_find_object
+ 1,94% libstdc++.so.6.0.30 [.] __gxx_personality_v0
+ 1,85% libgcc_s.so.1 [.] 0x00000000000157eb
+ 1,77% libc.so.6 [.] __memset_sse2_unaligned_erms
+ 1,36% ld-linux-x86-64.so.2 [.] __tls_get_addr
+ 1,28% libcoreclr.so [.] ExceptionTracker::ProcessManagedCallFrame
+ 1,26% libcoreclr.so [.] apply_reg_state
+ 1,12% libcoreclr.so [.] OOPStackUnwinderAMD64::UnwindPrologue
+ 1,08% libgcc_s.so.1 [.] 0x0000000000016990
+ 1,08% libcoreclr.so [.] ExceptionTracker::ProcessOSExceptionNotification

It is obvious that a lot of time (~13%) is spent in the C++ exception handling (_Unwind_Find_FDE, _dl_find_object, __gxx_personality_v0 and the unrecognized 0x00000000000157eb). When a software managed exception is thrown, there are actually two C++ throw / catches before we start propagating the managed exception through the managed code. We first throw PAL_SEHException, then we catch it only to call an exception filter and rethrow it again. The catch / rethrow emulates the Windows SEH exception filter capability that doesn’t exist in the C++ exception handling. This filter is used to replace the source address of the exception in the exception record, because the RaiseException that is used to throw the PAL_SEHException fills that address with the address from which the RaiseException was called, and we need the managed code exception source address there. But it is possible to do it better on Unix. We can add an overload of the RaiseException that has an extra argument to specify the exception source address instead. That way, one throw / catch can be eliminated.
The other catch catches the rethrown exception, because unlike with the SEH exception handling on Windows, C++ exceptions cannot propagate through our managed code. The reason is that the C++ unwinder in the C++ runtime doesn’t have a way to pass in any kind of information necessary for unwinding our generated code. So we need to catch the exception in the IL_Throw helper and then call our managed exception handling code. It is obvious that we could also get rid of this if we didn’t throw the PAL_SEHException in the first place but invoked our managed exception handling code directly.
It was implemented this way because it made the implementation mostly uniform with Windows and we didn’t consider exception handling performance to be that important at that time.
There is also about 4.5% coming from memmove and memset. These stem from the Windows CONTEXT data structure clearing and copying during exception handling. It seems that it might be lowered by copying only a part of the CONTEXT structure, namely just the non-volatile registers in some cases.
Since making the change to remove the filtering was a very simple one, it was made during the experiment to see how much impact it would have. And it turned out to result in a significant improvement in exception handling performance, as you can see from the following graph.
SingleThreadedSyncModified
A change to remove the RaiseException from IL_Throw completely is more involved, so it is left as one of the enhancements to do as a follow-up of this experiment. The C++ exception handling code is still on top of the exclusive samples even after the previous change, so a nice boost in performance is expected to happen again.

Linux CoreCLR – multi-threaded

In the multi-threaded perf trace there is another significant contributor besides the C++ exception handling. This is a trace for the multi-threaded synchronous case with exception propagation depth of 10 frames:

Overhead Shared Object Symbol
+ 20,47% libcoreclr.so [.] ExecutionManager::FindCodeRangeWithLock
+ 11,06% libcoreclr.so [.] ExecutionManager::IsManagedCodeWithLock
+ 5,08% libcoreclr.so [.] ExecutionManager::FindCodeRange
+ 3,52% libc.so.6 [.] __memmove_sse2_unaligned_erms
+ 2,51% libgcc_s.so.1 [.] _Unwind_Find_FDE
+ 2,44% libc.so.6 [.] __memset_sse2_unaligned_erms
+ 1,68% libcoreclr.so [.] ExceptionTracker::ProcessManagedCallFrame
+ 1,64% libcoreclr.so [.] SpinLock::AcquireLock
+ 1,55% libcoreclr.so [.] ExceptionTracker::ProcessOSExceptionNotification
+ 1,47% ld-linux-x86-64.so.2 [.] __tls_get_addr
+ 1,30% libcoreclr.so [.] OOPStackUnwinderAMD64::UnwindPrologue
+ 1,29% libcoreclr.so [.] GcInfoDecoder::GcInfoDecoder
+ 1,22% libcoreclr.so [.] ExecutionManager::IsManagedCodeWorker
+ 1,16% libcoreclr.so [.] HashMap::LookupValue
+ 1,12% libcoreclr.so [.] OOPStackUnwinderAMD64::VirtualUnwind
+ 1,09% libcoreclr.so [.] ProcessCLRException
+ 1,04% libcoreclr.so [.] EECodeInfo::Init
+ 0,92% libcoreclr.so [.] ExceptionTracker::GetOrCreateTracker
+ 0,88% libcoreclr.so [.] EEJitManager::FindMethodCode
+ 0,84% libcoreclr.so [.] IJitManager::GetFuncletStartAddress
+ 0,78% libcoreclr.so [.] ExceptionTracker::HandleNestedExceptionEscape
+ 0,65% libstdc++.so.6.0.30 [.] __gxx_personality_v0
+ 0,63% ld-linux-x86-64.so.2 [.] _dl_find_object
+ 0,59% libcoreclr.so [.] StackTraceInfo::SaveStackTrace
+ 0,53% libgcc_s.so.1 [.] 0x00000000000157eb
+ 0,52% libcoreclr.so [.] EEJitManager::JitCodeToMethodInfo

You can see the ExecutionManager::FindCodeRangeWithLock, ExecutionManager::IsManagedCodeWithLock and ExecutionManager::FindCodeRange taking 36.61% of time. These functions are looking up information on managed methods during stack unwinding.
This is yet another opportunity for performance improvement. Since async code often involves multiple threads, it is expected to help async exception handling quite a bit.

Linux NativeAOT – single-threaded

The perf trace shows that around 69% of time for the case of exception propagation stack depth of 10 frames is spent in the libunwind library unwinding the stack frames. This is expected, as during exception handling, most of the time should be spent doing the handling and the topmost contributors are quite simple functions. The memset is also a non-trivial contributor here, but it is used by libunwind internally to clear a bunch of relatively small data structures.
There is also pthread_mutex_lock/unlock contributing about 3.5%, but all the callers of these are in the dl_iterate_phdr Linux API that walks a list of shared libraries in the current process. This API is in turn used by libunwind.
Overall, there doesn’t seem to be an opportunity for improvement here except for a possible replacement of the LLVM libunwind.
Here is the top of the perf trace

Overhead Shared Object Symbol
+ 18,40% ehperf [.] libunwind::LocalAddressSpace::getEncodedP
+ 18,17% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::parseFDEInstructions
+ 9,70% ehperf [.] libunwind::EHHeaderParserlibunwind::LocalAddressSpace::findFDE
+ 7,98% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::parseCIE
+ 6,63% ehperf [.] libunwind::DwarfInstructions<libunwind::LocalAddressSpace, Registers_REGDISPLAY>::stepWithDwarf
+ 4,51% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::decodeFDE
+ 3,52% libc.so.6 [.] __memset_sse2_unaligned_erms
+ 2,44% ehperf [.] libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::R
+ 2,20% libc.so.6 [.] __GI___dl_iterate_phdr
+ 1,79% libc.so.6 [.] pthread_mutex_unlock@@GLIBC_2.2.5
+ 1,72% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
+ 1,60% ehperf [.] StackFrameIterator::NextInternal
+ 1,45% ehperf [.] libunwind::DwarfInstructions<libunwind::LocalAddressSpace, Registers_REGDISPLAY>::getSavedRegister
+ 0,99% [kernel.kallsyms] [k] __default_send_IPI_dest_field
0,95% ehperf [.] UnwindHelpers::StepFrame
+ 0,83% ehperf [.] UnixNativeCodeManager::UnwindStackFrame
+ 0,70% ehperf [.] UnixNativeCodeManager::IsFunclet
0,67% ehperf [.] __unw_init_local
+ 0,64% ehperf [.] UnixNativeCodeManager::FindMethodInfo
0,59% ehperf [.] libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::R
+ 0,53% libc.so.6 [.] __memmove_sse2_unaligned_erms
+ 0,52% ehperf [.] S_P_CoreLib_System_Exception__AppendExceptionStackFrame
+ 0,51% ehperf [.] __unw_getcontext
0,51% ehperf [.] FindProcInfo

Linux NativeAOT – multi-threaded

In the multi-threaded scenario, the pthread_mutex_lock/unlock and the related kernel futex / spinlock stuff is the dominant contributor to the performance. They consume more than 28% of the time. And like in the single-threaded scenario, all the usage of this is coming from the dl_iterate_phdr Linux API.
The results of this scenario in the absolute performance graphs show that the performance gets worse with the growing stack depth much faster than on the other platforms / runtimes. That means that the cost of unwinding a single stack frame is the highest of all the cases. The single threaded scenario shows a completely different picture where the NativeAOT and CoreCLR time to unwind a frame grows by about the same amount with growing number of frames though. That seems to confirm that the locks are the culprit here.
But since the locks are inside of the libunwind code, there doesn’t seem to be an opportunity for improvement except for possible replacement of the LLVM libunwind.

Overhead Shared Object Symbol
+ 11,09% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath.part.0
+ 6,51% ehperf [.] libunwind::LocalAddressSpace::getEncodedP
+ 6,12% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5
+ 5,09% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::parseFDEInstructions
+ 3,89% [kernel.kallsyms] [k] __entry_text_start
+ 3,67% libc.so.6 [.] pthread_mutex_unlock@@GLIBC_2.2.5
+ 3,26% [kernel.kallsyms] [k] futex_wake
+ 3,20% ehperf [.] libunwind::EHHeaderParserlibunwind::LocalAddressSpace::findFDE
+ 2,81% libc.so.6 [.] __GI___lll_lock_wait
+ 2,67% [kernel.kallsyms] [k] futex_wait_setup
+ 2,65% [kernel.kallsyms] [k] entry_SYSCALL_64_safe_stack
+ 2,51% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::parseCIE
+ 2,23% libc.so.6 [.] __GI___dl_iterate_phdr
+ 1,89% ehperf [.] libunwind::DwarfInstructions<libunwind::LocalAddressSpace, Registers_REGDISPLAY>::stepWithDwarf
+ 1,84% [kernel.kallsyms] [k] syscall_return_via_sysret
+ 1,81% [kernel.kallsyms] [k] __get_user_nocheck_4
+ 1,71% ehperf [.] libunwind::CFI_Parserlibunwind::LocalAddressSpace::decodeFDE
+ 1,20% libc.so.6 [.] __memset_sse2_unaligned
+ 1,16% libc.so.6 [.] __memset_sse2_unaligned_erms
+ 1,14% ehperf [.] libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection
+ 1,08% [kernel.kallsyms] [k] psi_group_change
+ 1,08% ehperf [.] StackFrameIterator::NextInternal
+ 0,95% [kernel.kallsyms] [k] _raw_spin_lock
+ 0,95% [kernel.kallsyms] [k] hash_futex
+ 0,88% ld-linux-x86-64.so.2 [.] _dl_tls_get_addr_soft
+ 0,72% [kernel.kallsyms] [k] futex_wait
+ 0,67% [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
+ 0,64% [kernel.kallsyms] [k] try_to_wake_up
+ 0,56% [kernel.kallsyms] [k] __schedule
+ 0,53% [kernel.kallsyms] [k] finish_task_switch.isra.0
+ 0,52% ehperf [.] UnixNativeCodeManager::UnwindStackFrame
0,51% ehperf [.] libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister
+ 0,51% [kernel.kallsyms] [k] _raw_spin_lock_irqsave

Windows CoreCLR – single-threaded

The raw performance trace shows that 10.4% of time is spent in the ntdll!RtlAcquireSRWLockShared and ntdll!RtlReleaseSRWLockShared functions in Windows. These guard the function table lookup in Windows, so they were folded in in the Perfview tool to see a higher-level picture.
The resulting trace is shown below. It shows that 44% of the time is spent in the OS unwinding and function lookup. This is something we could only improve on if we stopped using Windows SEH for managed exception handling and moved to a different scheme, e.g. like the NativeAOT uses.
However, there might be opportunities for smaller size improvements even with the current exception handling. 2.2% is spent in coreclr!ExecutionManager::FindCodeRangeWithLock that was identified as a bottle neck on Linux. And as it can be seen from the trace, there are many other contributors in the range of 1..2.5% that sum together to 40% of the exclusive time. So chances are that by scraping off small bits here and there we might be able to get some visible performance improvement.

Name Exc % Exc Inc % Inc
ntdll!RtlpxLookupFunctionTable 11.4 4,525 11.4 4,525
ntdll!RtlpUnwindPrologue 11.2 4,441 11.2 4,441
ntdll!RtlLookupFunctionEntry 7.2 2,857 28.4 11,271
ntdll!RtlpxVirtualUnwind 6.5 2,579 17.7 7,020
ntdll!RtlpLookupDynamicFunctionEntry 3.6 1,425 9.8 3,889
coreclr!EEJitManager::JitCodeToMethodInfo 2.9 1,167 2.9 1,167
ntdll!RtlVirtualUnwind 2.9 1,137 17.9 7,099
ntoskrnl!EtwpWriteUserEvent 2.5 990 4.3 1,708
coreclr!ExceptionTracker::ProcessManagedCallFrame 2.4 941 18.7 7,405
coreclr!ProcessCLRException 2.4 938 93.3 36,969
ntdll!LdrpDispatchUserCallTarget 2.2 871 2.2 871
coreclr!ExecutionManager::FindCodeRangeWithLock 2.2 868 2.2 868
coreclr!memset 2.0 793 2.0 793
coreclr!ExceptionTracker::ProcessOSExceptionNotification 1.9 742 31.9 12,622
coreclr!SString::Replace 1.8 720 1.8 720
ntoskrnl!EtwpReserveTraceBuffer 1.8 718 1.8 718
coreclr!FillRegDisplay 1.8 709 1.8 709
ntdll!NtTraceEvent 1.7 673 7.1 2,803
coreclr!GetRuntimeFunctionCallback 1.7 660 6.1 2,404
coreclr!__InternalCxxFrameHandler<__FrameHandler4> 1.6 615 1.7 660
coreclr!EECodeInfo::Init 1.5 582 6.7 2,643
coreclr!StackFrameIterator::NextRaw 1.4 554 3.3 1,304
coreclr!GcInfoDecoder::GcInfoDecoder 1.4 554 1.4 554
coreclr!ETW::SamplingLog::SaveCurrentStack 1.2 484 41.6 16,495
coreclr!StackTraceArray::Append 1.2 474 1.2 474
coreclr!Thread::VirtualUnwindCallFrame 1.2 473 43.7 17,316
coreclr!PrettyPrintType 1.1 429 1.1 429
ntoskrnl!NtTraceEvent 1.1 424 5.4 2,132
coreclr!Thread::VirtualUnwindNonLeafCallFrame 1.1 417 19.5 7,715
coreclr!ExceptionTracker::InitializeCrawlFrame 1.0 415 4.5 1,787
ucrtbase!__stdio_common_vsnprintf_s 1.0 411 1.0 411
ntdll!RtlUnwindEx 1.0 402 68.9 27,282
coreclr!StackTraceInfo::SaveStackTrace 1.0 377 2.1 851

Windows CoreCLR – multi-threaded

Like in the single-threaded scenarios, non-trivial amount of time is spent in the ntdll!RtlAcquireSRWLockShared and ntdll!RtlReleaseSRWLockShared functions in Windows. And this time also in ntdll!RtlBackoff. These three sum to 65.3% exclusive! So like in the single-threaded scenario, they were folded in in the Perfview tool again to see a higher level picture.
The following perf trace shows that 74.6% of time is spent in unwinding and related functions in Windows. Regarding the other stuff, we can see 2.1% of time spent in the coreclr!ExecutionManager::FindCodeRangeWithLock like in the single-threaded case. But there are also 3% of time spent in the coreclr!MethodDesc::GetFullMethodInfo, which formats a method signature. Looking at this method source in PerfView, it seems we are burning 40% of that time in CQuickBytes constructor, which boils down to memory zeroing of ~512B large buffer. And then about 25% in converting UTF8 to 16 bit chars and 25% in getting the method signature data. It seems we may be able to improve this method.
But overall, we could only improve the performance significantly if we stopped using Windows SEH for managed exception handling and moved to a different scheme, e.g. like the NativeAOT uses.

Name Exc % Exc Inc % Inc
ntdll!RtlpxLookupFunctionTable 62.7 534,210 62.7 534,252
ntdll!RtlpLookupDynamicFunctionEntry 4.7 40,256 7.2 61,220
coreclr!MethodDesc::GetFullMethodInfo 3.0 25,254 3.0 25,254
coreclr!ExceptionTracker::ProcessOSExceptionNotification 2.4 20,072 28.4 241,759
ntdll!RtlpUnwindPrologue 2.1 17,748 2.1 17,748
coreclr!ExecutionManager::FindCodeRangeWithLock 2.1 17,528 2.1 17,529
ntdll!RtlLookupFunctionEntry 1.9 16,128 71.7 611,602
ntoskrnl!RtlpUnwindPrologue 1.8 15,489 1.8 15,489
ntoskrnl!RtlpLookupFunctionEntryForStackWalks 1.8 15,486 1.8 15,486
ntdll!RtlpxVirtualUnwind 1.4 12,135 3.5 29,885
ntoskrnl!EtwpWriteUserEvent 1.3 11,254 1.3 11,254
coreclr!StackTraceInfo::SaveStackTrace 1.2 10,271 1.2 10,272
coreclr!EECodeInfo::Init 1.2 9,948 3.2 27,478
coreclr!ProcessCLRException 1.1 9,541 91.9 783,179
coreclr!Thread::StackWalkFramesEx 1.1 9,125 1.1 9,125

Hardware vs Software exceptions on Windows NativeAOT

The graph comparing hardware and software exceptions performance in the section with performance measurements has shown that the Windows NativeAOT case is an outlier with the biggest difference. So perfview was used to capture the single-threaded case with 0 frames depth for both hardware and software cases and then diffed. Here is the result:

Name Exc % Exc Inc % Inc
OS <<ntoskrnl!KiPageFault>> 57.1 881 57.1 881
OS <<ntoskrnl!KiSystemServiceCopyEnd>> 23.5 363 23.5 363
OS <<ntdll!KiUserExceptionDispatch>> 14.9 230 15.1 233
module ehperf <<ehperf!RhpThrowHwEx>> 7.1 109 30.7 474

This shows that the biggest portion of the difference is caused by page fault handling of the null reference and the related OS exception dispatching. It is interesting to note that there is not such a big difference in Windows CoreCLR. It is most likely caused by the fact that software exception handling performance is much better on NativeAOT, so the added overhead of the null reference handling in the OS is more pronounced.

Leveraging NativeAOT exception handling in CoreCLR

The investigations made in this experiment has confirmed that NativeAOT exception handling is superior to the CoreCLR one except for the Linux multi-threaded case. That has proven that attempting to make use of the NativeAOT way of exception handling in CoreCLR would be worth trying, especially on Windows.
So another experiment to assess feasibility, complexity and potential performance improvements was performed. To enable better understanding of what was done as part of the experiment, let's explain the differences between the NativeAOT and CoreCLR exception handling from a high level point of view.
The CoreCLR exception handling is based on the Windows SEH (structured exception handling) on Windows and on a rough emulation of it on Unix. Let's leave the Unix version aside and describe the Windows way. The exception handling including the related stack unwinding and the lookup of unwinding information for all function on the unwound part of the stack is performed inside of Windows OS. Our runtime gets called back on each managed code frame by the OS so that it can check if the specific frame is going to handle the exception or not and invoke the finally blocks along the way and the catch block that handles the exception. The exception handling is a two pass process, which means the frames are processed twice. First pass looks for the handling frame and also builds the exception stack trace. The second pass goes again through all the frames up to the handling frame and then invokes the catch block. This time, the frames are unwound and their stack space reclaimed.
Since the SEH also works for C++ frames and some assembler helper frames, it works transparently even for cases when managed frames are interleaved with native frames.
Due to the above mentioned fact that the SEH is executed by the OS and the investigations have shown that majority of the time during exception handling is spent in the OS code, we cannot do much to improve its performance.
On Unix, the exception handling uses an emulation of the Windows SEH for the managed code frames and standard C++ handling for native code frames. On the boundaries between these two worlds, it switches between those two exception handling mechanisms. To stay on the safe side, this propagation of exceptions through managed / native boundaries is only allowed for runtime native code and not for any user code.

The NativeAOT exception handling, on contrary, has the main part implemented fully in the managed code in the runtime. Only the stack walking is implemented in native code and uses the Windows APIs / Unix libunwind for moving from frame to frame. In NativeAOT, managed exceptions never cross managed / native boundaries, so it needs to process only managed frames. In addition to the exception handling code being managed, it is very simple and easy to reason about. It consists only of about 300 lines of code. This managed code uses a couple of simple helpers written in native code to iterate over the stack frames and to invoke the finally blocks, catch blocks and filters. Like the SEH, it uses two pass exception handling.

Now that you have a rough understanding on how the two ways of exception handling work, let's discuss the experiment on using the NativeAOT exception handling way in CoreCLR. As mentioned above, the NativeAOT exception handling code has two major parts - the managed code and the native code helpers. The idea was to take the managed code as is and just implement the native helpers in CoreCLR using the stack walker that's already there and it is used mostly for GC stack walks and few other things. Then see if there are any missing pieces and if it can be made to work at least to some extent. This plan was executed and it went very well. A managed application containing a copy of the NativeAOT exception handling code was created. Then the native helpers were added to coreclr.dll and the calls to them in the exception handling code changed to plain pinvokes. Finally, a managed "Throw" method was added that captures the current processor state (using an additional native helper) and the NativeAOT exception handling code was invoked from that. To test it, a simple testing code that throws an exception over a bunch of frames using the Throw method, catches it and prints some message in the catch and after the catch was added. Then this testing code was used to debug the implementation. Except for few minor rough edges, it went smoothly and it only took two days from start to end to make it work.
That was very encouraging, so as a next step, the code of the testing application that was used during the whole exception handling experiment was added to the testing application and modified so that instead of using the "throw" keyword, it used the "Throw" method described above. Then the testing app was executed for single threaded and multi-threaded synchronous exception handling for the same set of exception propagation stack depths. The results comparing the performance to the regular CoreCLR and NativeAOT can be seen in the graphs below.
ExperimentalCoreCLRSingleThreaded
ExperimentalCoreCLRMultiThreaded
The results clearly show that it would be definitely beneficial to move to the NativeAOT way of exception handling in CoreCLR. And besides the performance benefits, code unification would also be nice. There are obviously several things to do to get from the experimental port of the code to a final state. For example, hardware exception handling, exception rethrowing, debugger support, exception filters, finally blocks invocation, ensuring proper coordination with GC, replacing the IL_Throw native helper invocation by calling the managed exception handling code, interoperability with native exceptions, etc.

Summary

This section summarizes the findings scattered over the previous paragraphs.

  • The Linux CoreCLR software exceptions handling performance can be improved significantly by modifying the IL_Throw helper to not to use C++ exception handling. There are two C++ exception throws / catches before we start to unwind the managed exception. One of the throw / catch is trivial to remove and it was already done during the experiments. The other one will require a bit more involved change, but still relatively simple one.
  • The Linux NativeAOT has the worst multi-threaded performance of all the Windows / Linux and CoreCLR / NativeAOT combinations for exceptions propagated over two or more frames. Comparison with the single threaded performance shows that the most likely culprit are locks in the LLVM libunwind. It would be interesting to try to replace the LLVM libunwind by the other libunwind that we use in CoreCLR. Or even consider developing a simplistic DWARF unwinder that would support only a limited subset of the unwind instructions that our JIT generates.
  • In Linux CoreCLR, the lookup of info on managed functions in ExecutionManager::FindCodeRangeWithLock, ExecutionManager::IsManagedCodeWithLock, ExecutionManager::FindCodeRange takes over 36% of the total time. Investigations have shown that the lookup used by these methods is using a linear search through a linked list. David Wrighton mentioned to me recently that he was working on improving this code significantly.
  • The async exception handling performance measurements have shown that it basically boils down to synchronous exception handling performance of throw and catch over a single frame multiplied by the number of async frames unwound. Effectively, we rethrow the exception at each async boundary and catch it before leaving the async block. Asynchronous code also involves multiple threads, so any improvements in the multi-threaded exception handling will directly project into improvements in the async exception handling performance.
    In addition to these incremental improvements, Stephen Toub has come up with an idea to change the code generated for each async await block that is not in a user written try block. We would essentially extend the awaiter pattern with an additional method to get an exception if any. Then instead of throwing the exception and catching it right away, we would just add the current frame to the exception’s stack trace and pass the exception object on. There are some slight concerns, since it would make the non-exceptional code path a bit more complex, but maybe the effect will be negligible. This would require changes in Roslyn and Async scheduling stuff only, but no changes in the exception handling would be necessary. Stephen is planning to do an experiment in this area.
  • Linux CoreCLR performance traces have shown that memcpy / memset takes about 4.5..6% in the performance trace. These are calls that clear / copy the Windows style CONTEXT, which is about 1kB large. It seems we could copy just the non-volatile registers (at least in some of the cases) and save a couple of percent here.
  • The machine that was used to run test scenarios had 10 physical cores with hyper threading enabled, so Environment.ProcessorCount reports 20. It was discovered that running 20 threads with EH in parallel in the multi-threaded scenario made the measured times per thread more than two times slower than when running it on 10 threads in parallel. After discovering that, all the multi-threaded measurements were re-done with 10 threads. Since the multi-threaded scenarios were spending a lot of time in locks, it seems that with lock contention, the hyper threading is performing badly. While it is orthogonal to the exception handling experiment, it would be worth investigating in detail in our scalability experiments.
  • Leveraging NativeAOT exception handling in CoreCLR seems to be feasible and beneficial, based on the quick testing port. Besides the performance gain, the benefit of unification of the exception handling code between NativeAOT and CoreCLR would also be nice.
  • The upper estimate of the cost of leveraging NativeAOT exception handling in CoreCLR is 12 dev weeks. This includes both the implementation and testing.

@filipnavara
Copy link
Member

filipnavara commented Mar 3, 2023

Thanks for the report. Regarding the dl_iterate_phdr bottleneck, can you check whether llvm-libunwind is built with the LIBUNWIND_USE_FRAME_HEADER_CACHE option or not? Caching the results could affect the performance quite significantly.

(Quick search seems to suggest that the option is not used, so adding it to src/coreclr/nativeaot/Runtime/CMakeLists.txt and re-running the multithreaded Linux AOT test would be interesting.)

I tested the LIBUNWIND_USE_FRAME_HEADER_CACHE option and it didn't have a positive throughput improvement.

@filipnavara
Copy link
Member

filipnavara commented Mar 4, 2023

I did an experiment with implementing a trivial cache for findUnwindSections during the unwinding. The idea being that if the address is in the same text segment as last frame, we can use the last unwind sections and skip dl_iterate_phdr. In a multi-threaded scenario on my machine (Ryzen 7950X) the throughput went from ~19500 exceptions per second (eps) to ~35000 eps. That's nearly 79% improvement.

Now, the trivial implementation was intentionally a bit too trivial. I used a global variable with a lock. That's still creating a bottleneck of sorts, and it ignores shared library unloading. That said, if we assume that the code on the stack won't unload during exception unwinding, the potential for improvement is quite significant for a rather trivial change. It doesn't even need any modification in llvm-libunwind itself.

(Perhaps we can just lookup the unwind sections in UnixCodeManager once and just reuse them everywhere we know the address to be a managed code.)

@filipnavara
Copy link
Member

I have rewritten the findUnwindSections experiment above, and actually measured it in correct build configuration now.

The exceptions per second is ~145000 now, so roughly 7x faster.

Diff: https://github.com/dotnet/runtime/compare/main...filipnavara:runtime:cache_unwind_sections?expand=1
Test code: https://gist.github.com/filipnavara/9dca9d78bf2a768a9512afe9233d4cbe (compiled and published as -c Debug)

@En3Tho
Copy link
Contributor

En3Tho commented Mar 5, 2023

I wonder if there can be some kind of "pgo"-like exception handling. Like locating the most exception-frequent paths and somehow optimizing those even further?

@wasabii
Copy link

wasabii commented Apr 24, 2023

I'm the current maintainer of IKVM. This would be lovely. ;)

Stack walking performance in .NET has always been pretty terrible. But, given the nature of IKVM, I've noticed it a lot lately. Enough that I googled about it and found this issue.

I don't know anything about the internals of the CLR on this. But on the JVM, it tends to be so efficient that it's used for numerous tasks, even security checks to determine the calling path, etc.

@mangod9
Copy link
Member

mangod9 commented Jul 3, 2024

Closing since this has completed.

@mangod9 mangod9 closed this as completed Jul 3, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants