Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize WithLower, WithUpper, Create, AsInt64, AsUInt64, AsDouble with ARM64 hardware intrinsics #37139

Merged
merged 12 commits into from
Jun 3, 2020

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented May 29, 2020

Optimizes following APIs using hardware intrinsics:

Update:

After talking to Tanner and Egor, we decided to not optimize the Insert+Extract combination in JIT, but do it in separate PR when we implement InsertSelectedScalar intrinsic. So currently, optimize WithUpper, WithLower and Create in managed code. Here is the code we generate after this change:

withlower_after_nojit.txt
withupper_after_nojit.txt
create_after_nojit.txt

Contributes to #33308 and #33496.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 29, 2020
@kunalspathak
Copy link
Member Author

@echesakovMSFT , @tannergooding , @TamarChristinaArm , @BruceForstall

@TamarChristinaArm
Copy link
Contributor

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

Also the Create variants have some unneeded moves

        0EB01E10          mov     v16.8b, v16.8b
        6E180630          mov     v16.d[1], v17.d[0]
        4EB01E00          mov     v0.16b, v16.16b

First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores

        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]

Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

@@ -937,13 +937,23 @@ static Vector128<ulong> SoftwareFallback(ulong e0, ulong e1)
/// <returns>A new <see cref="Vector128{Byte}" /> initialized from <paramref name="lower" /> and <paramref name="upper" />.</returns>
public static unsafe Vector128<byte> Create(Vector64<byte> lower, Vector64<byte> upper)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods should be aggressively inlined.
It might also be good to just treat them as intrinsic and to create the appropriate nodes in importation or lowering so these don't impact inlining of other methods due to increased IL size or additional locals introduced.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the implementation to import to appropriate instructions. Curious, what is preferable way to do these things? Importation or lowering and what are the advantages? One advantage I see doing it early is so the nodes are passed through other optimizations (if applicable).

@tannergooding
Copy link
Member

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

That is actually the codegen for the fallback case (not relevant to ARM64 except for indirect calls, like via a delegate) and is due to the fallback using return Unsafe.As<Vector64<T>, Vector64<U>(ref vector). The intrinsic case is that these are fully elided in importation and as such they don't even create nodes and can't generate additional code.

src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/hwintrinsic.cpp Show resolved Hide resolved
src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lowerarmarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/codegenlinear.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good

@kunalspathak kunalspathak merged commit 9c5d406 into dotnet:master Jun 3, 2020
@kunalspathak
Copy link
Member Author

Sorry @TamarChristinaArm , I forgot to reply.

Thanks @kunalspathak!

These look great, for the .AsXX variants you've only posted the Vector64 variants, I assume the Vector128 ones are the same except use a q register?

Correct. We already optimize Vector128.AsXX() while this PR optimizes Vector64.AsXX().

As a general question, these .AsXX are re-interpret casts right? i.e. bits are just re-interpreted, why is there any code generated at all then? I'm guessing in this case because it doesn't track live ranges across BB?

I believe so. There is a overall problem where arguments are pushed to stack and retrieved. Related #35635. Perhaps @CarolEidt might know exact reasons.

Also the Create variants have some unneeded moves

        0EB01E10          mov     v16.8b, v16.8b
        6E180630          mov     v16.d[1], v17.d[0]
        4EB01E00          mov     v0.16b, v16.16b

First one is a no-op and last one can be avoided by having allocated to v0. But that's a general issue from the looks of the dumps.

This has changed in this PR. See my updated comments in PR description.

        4E083E20          umov    x0, v17.d[0]
        4E181C10          ins     v16.d[1], x0
        4EB01E00          mov     v0.16b, v16.16b

Unrelated question to your change, but why does it allocate so much stack space?

A9BD7BFD stp fp, lr, [sp,#-48]! seems to allocate 48 bytes and only stores 16.

There's also a weird gap in between the stores

        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]

Could it be thinking that internally all vectors are 16 bytes?

If they were placed next to eachother you could optimize these with stp and ldp, to be exact your code in the second BB could be a single ldr q0, ... if the stores are ordered correctly and you wouldn't need the inserts. (though tracking the live ranges would fix all of this).

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

@CarolEidt
Copy link
Contributor

Looks like genTotalFrameSize in most of the cases returns 48. Again, @CarolEidt, can confirm why there is a gap?

It certainly seems as if the assignment of frame locations is allocating 16 bytes for the 8 byte vectors. It would take some investigation to figure out why.

@CarolEidt
Copy link
Contributor

I think the problem is that Compiler::getSIMDTypeAlignment which is called byCompiler::lvaAllocLocalAndSetVirtualOffset always returns 16 for TARGET_ARM64.

@TamarChristinaArm
Copy link
Contributor

Thanks @kunalspathak !

This has changed in this PR. See my updated comments in PR description.

        4E083E20          umov    x0, v17.d[0]
        4E181C10          ins     v16.d[1], x0
        4EB01E00          mov     v0.16b, v16.16b

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

@kunalspathak
Copy link
Member Author

hmm why is it moving it between register files now though? I would have expected the same mov v16.d[1], v17.d[0] as before.

Yes, Initially I was doing that optimization of generating mov dstReg[index1], srcReg[index2] in JIT, but we decided to hold that for now and instead do it once we implement InsertSelectedScalar (hopefully sometime soon).

@kunalspathak
Copy link
Member Author

Opened #37429 to track the alignment questions that @TamarChristinaArm had.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants