Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit AVX-512 vector instructions #8264

Closed
itsamelambda opened this issue May 31, 2017 · 11 comments
Closed

Emit AVX-512 vector instructions #8264

itsamelambda opened this issue May 31, 2017 · 11 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@itsamelambda
Copy link

With the announcement of Skylake-X, AVX-512 is going mainstream.

The CLR should emit AVX-512 vector instructions that System.Numerics.Vector can use.

category:cq
theme:vector-codegen
skill-level:expert
cost:extra-large

@jkotas
Copy link
Member

jkotas commented May 31, 2017

cc @CarolEidt

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@oscarbg
Copy link

oscarbg commented Apr 11, 2018

News on this?

@CarolEidt
Copy link
Contributor

This is likely to be a large work item, and I don't know how high this will land on the priority list for future work.
Much of the work will be up-front investigation work, and that could be particularly amenable to community contribution:

  • What are the VM requirements - that is, how does the VM communicate with the various OS's to not just identify whether the OS supports the feature, but how context is/should be saved.
  • What are the ABI implications - that is, how are 512-byte vectors passed in the standard calling convention (on each OS) and how do they impact potential future implementation of the vector calling convention?
  • Do the AVX-512 instructions have characteristics that must be modeled by the JIT, and that are not exposed by existing hw intrinsics?

@zhongkaifu
Copy link

Can you please prioritize it ? Our project heavily depends on System.Numerics.Vector and we are using Intel(R) Xeon(R) Platinum 8168 CPU (Skylake).

@tannergooding
Copy link
Member

@zhongkaifu, do you have any numbers on how much of a performance increase AVX-512 is (both for a specific workload, and for applications as a whole)?

@zhongkaifu
Copy link

@tannergooding Our existing code could get 2x performance increase from AVX-128bits to AVX-256bits. For AVX-512, since existing System.Numerics.Vector doesn't support it yet, we cannot test it.

@tannergooding
Copy link
Member

@zhongkaifu, it may be worth getting some experimental numbers using native code. Having some information showing that this improves your overall scenario would help to prioritize the work appropriately.

Like with many new ISAs or alternative algorithms, using AVX-512 isn't always a clear cut perf gain and benchmarking/profiling is important. Depending on the processor, workload, etc, they can actually reduce the frequency of your processor (temporarily) and impact the overall performance of the process (or other processes).

A simple search will show some blog posts from various consumers and some technical sheets from Intel which describe both the benefits and drawbacks that AVX-512 can find

You might want to see the following from CloudFlar: https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

and this Spec from Intel (see Erratta 24, and others): https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html

@zhongkaifu
Copy link

Thanks @tannergooding . I've read these articles, and they are really helpful. I didn't know this problem before. I may use MKL to run some tests and figure out how many gain we can have.

@EgorBo
Copy link
Member

EgorBo commented Jan 3, 2019

https://godbolt.org/z/bX3h2h
ported a piece of HashCode.cs to C and allowed clang to use avx-512
it inserted vprold (instead of vpsrld vpslld vpor)🙂 so I could try to vectorize it in C#:

public static int Combine(Vector128<uint> values)
{
    Vector128<uint> hash = seedVec;

    hash = Sse2.Add(hash, Sse41.MultiplyLow(values, Vector128.Create(Prime2)));

    // these three instructions could be a single `vprold` with AVX-512
    hash = Sse2.Or(
        Sse2.ShiftLeftLogical(hash, 13),
        Sse2.ShiftRightLogical(hash, 19));

    hash = Sse41.MultiplyLow(hash, Vector128.Create(Prime1));

    // same here - `vprold`
    hash = Sse2.Or(
        Avx2.ShiftLeftLogicalVariable(hash, Vector128.Create(1u, 7u, 12u, 18u)),
        Avx2.ShiftRightLogicalVariable(hash, Vector128.Create(31u, 25u, 20u, 14u)));

    // horizontal sum and add 16 to the result
    var hashAsInt32 = hash.AsInt32();
    hashAsInt32 = Ssse3.HorizontalAdd(hashAsInt32, hashAsInt32);
    hashAsInt32 = Ssse3.HorizontalAdd(hashAsInt32, hashAsInt32);
    var sum16 = Sse41.Extract(hashAsInt32.AsUInt32(), 0) + 16;

    return (int)MixFinal(sum16);
}

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Oct 13, 2022
@BruceForstall BruceForstall removed arch-x86 arch-x64 JitUntriaged CLR JIT issues needing additional triage labels Oct 26, 2022
@BruceForstall
Copy link
Member

I'm going to close this in favor of #77034

@ghost ghost locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants