Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Cache location of unwind sections #82994

Merged
merged 4 commits into from
Mar 7, 2023

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Mar 5, 2023

Abstract

In issue #77568 the exception handling performance was tested on various scenarios. For Linux AOT a bottleneck was identified in the findUnwindSections method. Specifically, for multi-threaded scenario there's a significant performance penalty due to the usage of dl_iterate_phdr API which internally uses a lock.

A simple observation is that nearly all the frames we try to unwind are the compiled managed code which uses the same unwind table all the time. We can cache the value upfront and avoid all the lookups entirely.

Another side-effect of this is that it also helps the code paths that do thread hijacking during GC, and it potentially avoids some locks in those code paths.

Implementation

The implementation reshuffles the implementation of the FindProcInfo and VirtualUnwind methods and moves them into UnixCodeManager where the UnwindInfoSections value is cached.

The llvm-libunwind API offers two ways to inject the cached information about the unwind sections. It can either be done through custom AddressSpace class implementation which has the benefit that high-level C++ API can be reused by simply switching one template parameter. Alternatively, the low-level C++ API can be used directly and the information just passed to it. Since the unwinding code already used the low-level API in most cases I opted to go that route.

Testing

Test code

using System.Diagnostics;
using System.Runtime.CompilerServices;

internal class Program
{
    static int exceptionsHandled = 0;

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CallMe(int i)
    {
        if (i == 0)
        {
            throw new NotImplementedException();
        }

        CallMe(i - 1);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void CatchMe()
    {
        try
        {
            CallMe(100);
        }
        catch (NotImplementedException)
        {
            Interlocked.Increment(ref exceptionsHandled);
        }
    }

    private static void ThreadEntrypoint()
    {
        while (true)
        {
            CatchMe();
        }
    }

    private static void Main(string[] args)
    {
        int savedExceptionsHandled = 0;
        for (int i = 0; i < 10; i++)
        {
            new Thread(ThreadEntrypoint).Start();
        }
        Thread.Sleep(5000);
        savedExceptionsHandled = exceptionsHandled;
        Console.WriteLine($"Exceptions per second: {savedExceptionsHandled / 5}");
        Environment.Exit(0);
    }
}

The test code was injected into an empty application created with dotnet new console and then compiled with dotnet publish -p:PublishAot=true -r linux-x64 -c Debug.

My test configuration is a Ryzen 7950X machine with Ubuntu 22.04.2 LTS in Windows Subsystem for Linux. Baseline is .NET 8 Preview 1, where I get ~19500 exceptions per second. With this PR I get around 145000 exceptions per second, or more than 7 times as fast throughput.

I also briefly tested on MacBook Air M1 in osx-arm64 configuration. The throughput of the PR is about 1.78x faster than the .NET 8 Preview 1 baseline.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 5, 2023
@ghost
Copy link

ghost commented Mar 5, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

TBD: Just testing build on different configurations...

Author: filipnavara
Assignees: -
Labels:

community-contribution, area-NativeAOT-coreclr

Milestone: -

@filipnavara filipnavara marked this pull request as ready for review March 5, 2023 14:59
@janvorli
Copy link
Member

janvorli commented Mar 6, 2023

@filipnavara. The result is awesome. I have ran the tests I have used in my analysis with this change. Originally, the Linux NativeAOT was clearly not scaling at all, now the multi-threaded performance is only about 10% worse than the single threaded one.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@VSadov
Copy link
Member

VSadov commented Mar 6, 2023

This can improve GC root reporting too as that performs stack walks and in server GC case does it on multiple threads.

@VSadov
Copy link
Member

VSadov commented Mar 6, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Thanks!!

@VSadov VSadov merged commit 013ca67 into dotnet:main Mar 7, 2023
@filipnavara filipnavara deleted the cache_unwind_sections branch March 7, 2023 06:37
@marek-safar marek-safar changed the title [NativeAOT] Experiment: Cache location of unwind sections [NativeAOT] Cache location of unwind sections Mar 9, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants