Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Merged
merged 2 commits into from
Jun 24, 2020

Conversation

kunalspathak
Copy link
Member

As mentioned in #37139 , use AdvSimd.Arm64.InsertSelectedScalar() to better optimize WithUpper(), WithLower() and hence Vector28.Create(Vector64, Vector64).

Also, added a note in SpanHelpers.SequenceEqual() and SpanHelpers.SequenceCompareTo() that we are not optimizing it currently with ARM64 intrinsics because it gives similar (sometimes less) wins than the vectorized version.

@kunalspathak
Copy link
Member Author

kunalspathak commented Jun 18, 2020

@TamarChristinaArm
Copy link
Contributor

Thanks @kunalspathak , the WithUpper and WithLower look good. I assume again that the redundant move goes away once we no longer spill the inputs at the first BB end.

I am however surprised that you don't see a performance gain with SpanHelpers.Sequence for two reasons:

  1. For spans longer than 31 bytes you can use ldp instead of ldr which has lower latency than 2x ldr.

  2. For non-full vectors you can use unaligned accesses. What does

// Do final compare as Vector<byte>.Count from end rather than start
if (LoadVector(ref first, lengthToExamine) == LoadVector(ref second, lengthToExamine))

return when lengthToExamine is 5? does it issue 5 byte loads? 4 bytes and 1 byte loads?
If the original length was 21 bytes for instance the optimal sequence would have been 16 + 8 by issuing an unaligned read and re-reading 3 bytes from the previous load.

@kunalspathak
Copy link
Member Author

Thanks for taking a look @TamarChristinaArm

I assume again that the redundant move goes away once we no longer spill the inputs at the first BB end.

Yes.

I am however surprised that you don't see a performance gain with SpanHelpers.Sequence

I think you meant SequenceEqual.

  1. For spans longer than 31 bytes you can use ldp instead of ldr which has lower latency than 2x ldr.

Currently we don't combine 2 ldr from subsequent memory into ldp. I have opened #35132 and #35130 to track it. By the way, just in case you didn't see, we have an epic issue #35853 to track all the ARM64 optimization opportunities. You might want to check the issues in peephole optimization category, in case you have any inputs.

return when lengthToExamine is 5? does it issue 5 byte loads? 4 bytes and 1 byte loads?

For original length of 21 bytes, after completing first 16 bytes comparison, it does another load of 16 bytes starting from (length - 16) i.e. 5,..,19, 20 for final comparison (if that's what you wanted to ask). This is true for existing implementation as well as when I tried with Vector128. You can see my implementation of SequenceEqual here and let me know if we can do better. Another reason for not seeing wins is because earlier SequenceEqual was getting pre-compiled using crossgen but once I add Arm64 intrinsics it stop crossgening and always JIT. I have a #38060 to make sure that method bodies with ARM64 intrinsics get crossgen. When this change was applied, I didn't see much win either. See generated code and is almost identical except MinAcross vs. MinPairwise usage.

Regarding SequenceCompareTo , you can see my implementation here. I used MinPairwise to fast check if the 16-bytes chunk matched. However, while in the loop, if there is a mismatch or at the end to do final comparison, I had to do CompareEqual, etc. to find the matching length. I was surprised that this is slower for cases where the mismatch is in first 7~8 bytes, in which case sequence comparison is faster. Sequential comparison does memory loads (atmost 15 times) while we don't have to do that many if arm64 intrinsic is used. That should make arm64 faster, but unfortunately I don't see that in perf numbers. Let me know if you have any suggestions.

See below the assembly output: The code size with my changes also increased.

Here are the perf numbers:

Method MisMatchIndex Before After
------------------ -------------- ----------: ----------:
SequenceCompareTo -1 131.48 123.62 -6%
SequenceCompareTo 0 17.2 23.52 37%
SequenceCompareTo 496 63.9 70.98 11%
SequenceCompareTo 497 63.06 73.15 16%
SequenceCompareTo 498 64.37 73.34 14%
SequenceCompareTo 499 65.54 71.23 9%
SequenceCompareTo 500 67.58 70.8 5%
SequenceCompareTo 501 67.86 74.63 10%
SequenceCompareTo 502 69 73.12 6%
SequenceCompareTo 503 71.99 73.07 2%
SequenceCompareTo 504 71.44 71.37 0%
SequenceCompareTo 505 72.65 71.31 -2%
SequenceCompareTo 506 73.8 70.86 -4%
SequenceCompareTo 507 74.81 71.22 -5%
SequenceCompareTo 508 76.01 71.3 -6%
SequenceCompareTo 509 78.69 71.26 -9%
SequenceCompareTo 510 80.22 71.28 -11%
SequenceCompareTo 511 81.02 71.42 -12%

Benchmark code:

[Params(
    -1, 0, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511)]
public int MisMatchIndex { get; set; }

[Benchmark(OperationsPerInvoke = Iters)]
public int SequenceCompareTo()
{
    byte[] input1 = s_source;
    byte[] input2 = input1.ToArray();

    if (MisMatchIndex != -1)
    {
        input2[MisMatchIndex] = 5; // something other than MisMatchIndexValue.
    }
    Span<byte> span1 = new Span<byte>(input1);
    ReadOnlySpan<byte> span2 = new ReadOnlySpan<byte>(input2);

    int total = 0;
    for (int i = 0; i < Iters; i++)
    {
        total += span1.SequenceCompareTo(span2);
    }

    return total;
}

[GlobalSetup]
public void Setup()
{
    var input = Enumerable.Range(0, 512).Select(b => (byte)b);
    s_source = input.Concat(input).ToArray();
}

(Note: -1 means both spans had same contents).

@TamarChristinaArm
Copy link
Contributor

By the way, just in case you didn't see, we have an epic issue #35853 to track all the ARM64 optimization opportunities. You might want to check the issues in peephole optimization category, in case you have any inputs.

Ah thanks! I've subcribed to it.

For original length of 21 bytes, after completing first 16 bytes comparison, it does another load of 16 bytes starting from (length - 16) i.e. 5,..,19, 20 for final comparison (if that's what you wanted to ask).

Yes, though your Q-form loads are usually the slowest ones. And older cores can't dual issue them. Using the smallest possible loads for the unaligned access would be optimimal but probably not that critical.
I think you'd get more bang for buck with load pairs, so I'd go with that first.

You can see my implementation of SequenceEqual here and let me know if we can do better.

Thanks!, that looks fine, I think the only gains you can get here is with LoadPairs,
the gain there aside from throughput also allows you to AND the vectors on the SIMD side and do a single MINP and a single transfer to the genreg side. So this isn't gains you can gain easily with a peephole since the optimization would I think only be valid In this context.

So what I mean is right now in order to process 32-bytes of data you do
4 loads
2 cmp
2 minwise
2 SIMD -> genreg transfers
2 NE comparisons against 0xFFFFFFFF

but with LDP you could do
2 load
3 and
1 minp
1 SIMD -> genreg transfer
1 NE comparison against 0xFFFFFFFF

Where your first two ANDs are dual issued.

I expect this to be faster but for this to work you'd need something like LoadPairVectors128.

I suspect you also have some long term gains to get here (on the long term) based on addressing modes, and codegen here, e.g.

	offset += Vector128.Size;
	vecResult = AdvSimd.CompareEqual(LoadVector128(ref first, offset), LoadVector128(ref second, offset));
	if (AdvSimd.Arm64.MinPairwise(vecResult, vecResult).AsUInt64().ToScalar() != ulong.MaxValue)
	{
		goto NotEqual;
	}

and initialize offset to -Vector128.Size;

your loads then becomes loads with a constant offset and writeback, so you remove the addition,
The branches then become conditional

so something like

neg x1, x1
cbnz x1, NotEqual
cmp lengthToExamine, offset
bgt .loop_start

These should speed things up along with ldp.

When this change was applied, I didn't see much win either. See generated code and is almost identical except MinAcross vs. MinPairwise usage.

System_Private_CoreLib!System.SpanHelpers.SequenceEqual(Byte ByRef, Byte ByRef, UIntPtr)+0x164:
 ldr         q16,[x19,x23]
 ldr         q17,[x20,x23]
 cmeq        v16.16b,v16.16b,v17.16b
 uminp       v16.16b,v16.16b,v16.16b
 umov        x11,v16.d[0]
 cmn         x11,#1
 bne         00007FFC4DDF156C  Branch

Yeah that looks as good as it can get for 2 128 vectors, on older cores you may see some difference using AND instead of CMEQ here which would be a wash on newer ones.

@kunalspathak kunalspathak merged commit bf8aba0 into dotnet:master Jun 24, 2020
@kunalspathak kunalspathak deleted the withupper-lower branch June 24, 2020 15:04
@kunalspathak
Copy link
Member Author

SequenceCompareTo and SequenceEqual contributes towards #33707 and #33308.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants