Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

kunalspathak · 2020-06-18T09:06:14Z

As mentioned in #37139 , use AdvSimd.Arm64.InsertSelectedScalar() to better optimize WithUpper(), WithLower() and hence Vector28.Create(Vector64, Vector64).

Also, added a note in SpanHelpers.SequenceEqual() and SpanHelpers.SequenceCompareTo() that we are not optimizing it currently with ARM64 intrinsics because it gives similar (sometimes less) wins than the vectorized version.

…thod

kunalspathak · 2020-06-18T09:06:33Z

@tannergooding , @AntonLapounov , @TamarChristinaArm

TamarChristinaArm · 2020-06-18T10:10:13Z

Thanks @kunalspathak , the WithUpper and WithLower look good. I assume again that the redundant move goes away once we no longer spill the inputs at the first BB end.

I am however surprised that you don't see a performance gain with SpanHelpers.Sequence for two reasons:

For spans longer than 31 bytes you can use ldp instead of ldr which has lower latency than 2x ldr.
For non-full vectors you can use unaligned accesses. What does

// Do final compare as Vector<byte>.Count from end rather than start
if (LoadVector(ref first, lengthToExamine) == LoadVector(ref second, lengthToExamine))

return when lengthToExamine is 5? does it issue 5 byte loads? 4 bytes and 1 byte loads?
If the original length was 21 bytes for instance the optimal sequence would have been 16 + 8 by issuing an unaligned read and re-reading 3 bytes from the previous load.

kunalspathak · 2020-06-18T21:51:48Z

Thanks for taking a look @TamarChristinaArm

I assume again that the redundant move goes away once we no longer spill the inputs at the first BB end.

Yes.

I am however surprised that you don't see a performance gain with SpanHelpers.Sequence

I think you meant SequenceEqual.

For spans longer than 31 bytes you can use ldp instead of ldr which has lower latency than 2x ldr.

Currently we don't combine 2 ldr from subsequent memory into ldp. I have opened #35132 and #35130 to track it. By the way, just in case you didn't see, we have an epic issue #35853 to track all the ARM64 optimization opportunities. You might want to check the issues in peephole optimization category, in case you have any inputs.

return when lengthToExamine is 5? does it issue 5 byte loads? 4 bytes and 1 byte loads?

For original length of 21 bytes, after completing first 16 bytes comparison, it does another load of 16 bytes starting from (length - 16) i.e. 5,..,19, 20 for final comparison (if that's what you wanted to ask). This is true for existing implementation as well as when I tried with Vector128. You can see my implementation of SequenceEqual here and let me know if we can do better. Another reason for not seeing wins is because earlier SequenceEqual was getting pre-compiled using crossgen but once I add Arm64 intrinsics it stop crossgening and always JIT. I have a #38060 to make sure that method bodies with ARM64 intrinsics get crossgen. When this change was applied, I didn't see much win either. See generated code and is almost identical except MinAcross vs. MinPairwise usage.

Regarding SequenceCompareTo , you can see my implementation here. I used MinPairwise to fast check if the 16-bytes chunk matched. However, while in the loop, if there is a mismatch or at the end to do final comparison, I had to do CompareEqual, etc. to find the matching length. I was surprised that this is slower for cases where the mismatch is in first 7~8 bytes, in which case sequence comparison is faster. Sequential comparison does memory loads (atmost 15 times) while we don't have to do that many if arm64 intrinsic is used. That should make arm64 faster, but unfortunately I don't see that in perf numbers. Let me know if you have any suggestions.

See below the assembly output: The code size with my changes also increased.

Here are the perf numbers:

Method	MisMatchIndex	Before	After
------------------	--------------	----------:	----------:
SequenceCompareTo	-1	131.48	123.62	-6%
SequenceCompareTo	0	17.2	23.52	37%
SequenceCompareTo	496	63.9	70.98	11%
SequenceCompareTo	497	63.06	73.15	16%
SequenceCompareTo	498	64.37	73.34	14%
SequenceCompareTo	499	65.54	71.23	9%
SequenceCompareTo	500	67.58	70.8	5%
SequenceCompareTo	501	67.86	74.63	10%
SequenceCompareTo	502	69	73.12	6%
SequenceCompareTo	503	71.99	73.07	2%
SequenceCompareTo	504	71.44	71.37	0%
SequenceCompareTo	505	72.65	71.31	-2%
SequenceCompareTo	506	73.8	70.86	-4%
SequenceCompareTo	507	74.81	71.22	-5%
SequenceCompareTo	508	76.01	71.3	-6%
SequenceCompareTo	509	78.69	71.26	-9%
SequenceCompareTo	510	80.22	71.28	-11%
SequenceCompareTo	511	81.02	71.42	-12%

Benchmark code:

[Params(
    -1, 0, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511)]
public int MisMatchIndex { get; set; }

[Benchmark(OperationsPerInvoke = Iters)]
public int SequenceCompareTo()
{
    byte[] input1 = s_source;
    byte[] input2 = input1.ToArray();

    if (MisMatchIndex != -1)
    {
        input2[MisMatchIndex] = 5; // something other than MisMatchIndexValue.
    }
    Span<byte> span1 = new Span<byte>(input1);
    ReadOnlySpan<byte> span2 = new ReadOnlySpan<byte>(input2);

    int total = 0;
    for (int i = 0; i < Iters; i++)
    {
        total += span1.SequenceCompareTo(span2);
    }

    return total;
}

[GlobalSetup]
public void Setup()
{
    var input = Enumerable.Range(0, 512).Select(b => (byte)b);
    s_source = input.Concat(input).ToArray();
}

(Note: -1 means both spans had same contents).

TamarChristinaArm · 2020-06-22T08:38:17Z

By the way, just in case you didn't see, we have an epic issue #35853 to track all the ARM64 optimization opportunities. You might want to check the issues in peephole optimization category, in case you have any inputs.

Ah thanks! I've subcribed to it.

For original length of 21 bytes, after completing first 16 bytes comparison, it does another load of 16 bytes starting from (length - 16) i.e. 5,..,19, 20 for final comparison (if that's what you wanted to ask).

Yes, though your Q-form loads are usually the slowest ones. And older cores can't dual issue them. Using the smallest possible loads for the unaligned access would be optimimal but probably not that critical.
I think you'd get more bang for buck with load pairs, so I'd go with that first.

You can see my implementation of SequenceEqual here and let me know if we can do better.

Thanks!, that looks fine, I think the only gains you can get here is with LoadPairs,
the gain there aside from throughput also allows you to AND the vectors on the SIMD side and do a single MINP and a single transfer to the genreg side. So this isn't gains you can gain easily with a peephole since the optimization would I think only be valid In this context.

So what I mean is right now in order to process 32-bytes of data you do
4 loads
2 cmp
2 minwise
2 SIMD -> genreg transfers
2 NE comparisons against 0xFFFFFFFF

but with LDP you could do
2 load
3 and
1 minp
1 SIMD -> genreg transfer
1 NE comparison against 0xFFFFFFFF

Where your first two ANDs are dual issued.

I expect this to be faster but for this to work you'd need something like LoadPairVectors128.

I suspect you also have some long term gains to get here (on the long term) based on addressing modes, and codegen here, e.g.

	offset += Vector128.Size;
	vecResult = AdvSimd.CompareEqual(LoadVector128(ref first, offset), LoadVector128(ref second, offset));
	if (AdvSimd.Arm64.MinPairwise(vecResult, vecResult).AsUInt64().ToScalar() != ulong.MaxValue)
	{
		goto NotEqual;
	}

and initialize offset to -Vector128.Size;

your loads then becomes loads with a constant offset and writeback, so you remove the addition,
The branches then become conditional

so something like

neg x1, x1
cbnz x1, NotEqual
cmp lengthToExamine, offset
bgt .loop_start

These should speed things up along with ldp.

When this change was applied, I didn't see much win either. See generated code and is almost identical except MinAcross vs. MinPairwise usage.

System_Private_CoreLib!System.SpanHelpers.SequenceEqual(Byte ByRef, Byte ByRef, UIntPtr)+0x164:
 ldr         q16,[x19,x23]
 ldr         q17,[x20,x23]
 cmeq        v16.16b,v16.16b,v17.16b
 uminp       v16.16b,v16.16b,v16.16b
 umov        x11,v16.d[0]
 cmn         x11,#1
 bne         00007FFC4DDF156C  Branch

Yeah that looks as good as it can get for 2 128 vectors, on older cores you may see some difference using AND instead of CMEQ here which would be a wash on newer ones.

kunalspathak · 2020-06-25T05:57:59Z

SequenceCompareTo and SequenceEqual contributes towards #33707 and #33308.

kunalspathak added 2 commits June 18, 2020 01:39

Add a note for SpanHelpers.SequenceCompareTo() and SequenceEqual() me…

c5b0527

…thod

Use InsertSelectedScalar for WithUpper() and WithLower()

ed93238

Dotnet-GitSync-Bot added the area-System.Memory label Jun 18, 2020

tannergooding approved these changes Jun 24, 2020

View reviewed changes

kunalspathak merged commit bf8aba0 into dotnet:master Jun 24, 2020

kunalspathak deleted the withupper-lower branch June 24, 2020 15:04

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

kunalspathak commented Jun 18, 2020

kunalspathak commented Jun 18, 2020 •

edited

Loading

TamarChristinaArm commented Jun 18, 2020

kunalspathak commented Jun 18, 2020

TamarChristinaArm commented Jun 22, 2020

kunalspathak commented Jun 25, 2020

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Optimize WithUpper/WithLower with InsertSelectedScalar, SpanHelpers.Sequence APIs #38075

Conversation

kunalspathak commented Jun 18, 2020

kunalspathak commented Jun 18, 2020 • edited Loading

TamarChristinaArm commented Jun 18, 2020

kunalspathak commented Jun 18, 2020

TamarChristinaArm commented Jun 22, 2020

kunalspathak commented Jun 25, 2020

kunalspathak commented Jun 18, 2020 •

edited

Loading