Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement SIMD UniformInt #561

Merged
merged 2 commits into from
Aug 3, 2018
Merged

implement SIMD UniformInt #561

merged 2 commits into from
Aug 3, 2018

Conversation

TheIronBorn
Copy link
Collaborator

Not sure what we should do about benchmarks. We really ought to have some benchmarks for UniformInt but there are 36 different vectors we could benchmark.

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 19, 2018

I don’t know wasm well enough to know why the test is failing

@pitdicker
Copy link
Contributor

I don’t know wasm well enough to know why the test is failing

And it worked yesterday 😢. Seems to be rust-lang/rust#52353

@TheIronBorn
Copy link
Collaborator Author

I've got a sample_single implementation in another branch with benchmarks: https://github.com/TheIronBorn/rand/blob/simd/benches/gen_range.rs

The performance seems very hardware dependent

@dhardy dhardy added the D-review Do: needs review label Jul 20, 2018
Copy link
Contributor

@sicking sicking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me FWIW!!

// The `range == 0` case is complicated with multiple lanes,
// avoid it.
assert!(range.eq($unsigned::splat(0)).none(),
"Uniform::new_inclusive called with the full range");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to deal actually implement this would be to set zone to whatever here for lanes where range==0. Then in sample, change the return statement to:

return range.ne($unsigned::splat(0)).select(self.low + $ty::from(hi), v);

It obviously adds some overhead, but so does our handling in the non-simd case.

It's possible that the overhead is bigger here though. In particular I'm not sure if the $unsigned::splat(0) call will get compiled into something fast. It would for non-simd types, but I don't know SIMD instructions well enough to know if that's the case here.

If there's no way to make the splat operation fast, you could also try:

return range.ne(hi).select(self.low + $ty::from(hi), v);

fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X {
let range = $unsigned::from(self.range);
// TODO: ensure these casts are a no-op
let zone = $unsigned::from($signed::from(self.zone));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think simply doing

$unsigned::from($signed::from(self.zone));

should work. We need multiple casts in the non-simd case because we're also changing the bit-width, but that's not the case here AFAICT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err.. I mean

$unsigned::from(self.zone);

$(
let v: &[($ty, $ty)] = &[
($ty::splat(0), $ty::splat(10)),
($ty::splat(10), $ty::splat(127)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fold this into the existing macro by using FloatSIMDUtils::cast_from_int and FloatSIMDUtils::all_le?

Or rather, replace the existing macro with your new macro, and just add the scalar types to it.

I guess we'd still have to keep the full-range tests separate unless we make the changes above.

@sicking
Copy link
Contributor

sicking commented Jul 21, 2018

I should also add that my approval is entirely non-official. It just means that I think the code looks good 😃

@TheIronBorn
Copy link
Collaborator Author

Here's the _mm_mullo_pi16 PR: rust-lang/stdarch#478

Currently investigating the range == 0 case.

@TheIronBorn
Copy link
Collaborator Author

Added code to handle the full range case. Builds fail because of the stdsimd changes. I can fix that when those changes stabilize.

There seems to be no significant change in benchmarks

 name                        no_full_range_support ns/iter  full_range_support ns/iter  diff ns/iter   diff %  speedup
+simd::distr_uniform_i16x16  13,230 (2418 MB/s)             12,070 (2651 MB/s)                -1,160   -8.77%   x 1.10 
+simd::distr_uniform_i16x2   8,769 (456 MB/s)               7,062 (566 MB/s)                  -1,707  -19.47%   x 1.24 
+simd::distr_uniform_i16x32  43,309 (1477 MB/s)             35,561 (1799 MB/s)                -7,748  -17.89%   x 1.22 
+simd::distr_uniform_i16x4   7,993 (1000 MB/s)              6,250 (1280 MB/s)                 -1,743  -21.81%   x 1.28 
+simd::distr_uniform_i16x8   3,695 (4330 MB/s)              3,409 (4693 MB/s)                   -286   -7.74%   x 1.08 
-simd::distr_uniform_i32x16  46,434 (1378 MB/s)             54,196 (1180 MB/s)                 7,762   16.72%   x 0.86 
-simd::distr_uniform_i32x2   6,253 (1279 MB/s)              7,861 (1017 MB/s)                  1,608   25.72%   x 0.80 
-simd::distr_uniform_i32x4   7,447 (2148 MB/s)              8,922 (1793 MB/s)                  1,475   19.81%   x 0.83 
+simd::distr_uniform_i32x8   18,559 (1724 MB/s)             17,130 (1868 MB/s)                -1,429   -7.70%   x 1.08 
+simd::distr_uniform_i64x2   6,651 (2405 MB/s)              6,313 (2534 MB/s)                   -338   -5.08%   x 1.05 
-simd::distr_uniform_i64x4   15,708 (2037 MB/s)             18,424 (1736 MB/s)                 2,716   17.29%   x 0.85 
+simd::distr_uniform_i64x8   27,984 (2287 MB/s)             27,929 (2291 MB/s)                   -55   -0.20%   x 1.00 
+simd::distr_uniform_i8x16   15,296 (1046 MB/s)             12,748 (1255 MB/s)                -2,548  -16.66%   x 1.20 
+simd::distr_uniform_i8x2    10,303 (194 MB/s)              8,399 (238 MB/s)                  -1,904  -18.48%   x 1.23 
+simd::distr_uniform_i8x32   27,191 (1176 MB/s)             25,844 (1238 MB/s)                -1,347   -4.95%   x 1.05 
+simd::distr_uniform_i8x4    10,192 (392 MB/s)              10,137 (394 MB/s)                    -55   -0.54%   x 1.01 
+simd::distr_uniform_i8x64   146,547 (436 MB/s)             120,560 (530 MB/s)               -25,987  -17.73%   x 1.22 
-simd::distr_uniform_i8x8    11,385 (702 MB/s)              13,822 (578 MB/s)                  2,437   21.41%   x 0.82 

Copy link
Contributor

@sicking sicking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'll defer to others for the test stuff.

let ints_to_reject = (unsigned_max - range + 1) % modulo;
// `unsigned_max - 0 + 1 % x` => `0 % x` => `0` so
// the zone for the full range case is `unsigned_max` which
// means only one sample is needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any value is ok since the sample code does (rand * range).low <= zone. But since range is 0 the low part will always be zero. So the test will always pass. But unsigned_max here is as good as any other value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned_max is nice as it allows a faster select. That and "range is 0" should probably be documented over this though.

let my_uniform = Uniform::new_inclusive(low, high);
for _ in 0..1000 {
let v: $ty = rng.sample(my_uniform);
assert!($le(low, v) && $le(v, high));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could introduce something like FloatSIMDUtils and FloatAsSIMD::splat, but for integer types. That would allow a lot of this test code to be written more nicely.

But I'll defer to @dhardy and @pitdicker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but what's written here is sufficient.

@TheIronBorn
Copy link
Collaborator Author

Updated for packed_simd

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 27, 2018

If we want to include benchmarks, 16-bit, 32-bit, and 64-bit lane widths will likely have very different performance on most hardware. At the very least doing benchmarks for u16x8, u32x4, u64x2, or other minimally supported widths might be sufficient.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As has already been said, this looks really good! Thanks @TheIronBorn!

There's no sample_single implementation; possibly it could benefit from the same range << range.leading_zeros() optimisation, but at this stage it doesn't seem important.

A couple of little points about arithmetic here — if it's wrapping by default (unlike other Rust numeric types) then there are no issues. (But are there plans to change this in the future?)


#[inline] // if the range is constant, this helps LLVM to do the
// calculations at compile-time.
// TODO: test this ^ for SIMD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning on testing? It seems fine. Note that #[inline] is needed to allow inlining in external crates, unless using (full) link-time-optimisation, so although the compiler might inline this anyway in internal tests, it wouldn't otherwise from a separate lib/bin.

"Uniform::new_inclusive called with `low > high`");
let unsigned_max = ::core::$u_scalar::MAX;

let range: $unsigned = ((high - low) + 1).cast();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SIMD arithmetic wrapping by default?

// replacing 0 with `unsigned_max` allows a faster `select`
// with bitwise OR
let modulo = not_full_range.select(range, $unsigned::splat(unsigned_max));
let ints_to_reject = (unsigned_max - range + 1) % modulo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If range is 0, this will overflow — or wrap to 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap to 0

// bulk implementation
($(($unsigned:ident, $signed:ident),)+ $u_scalar:ident) => {
$(
uniform_simd_int_impl!($unsigned, $unsigned, $signed, $u_scalar);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $signed parameter appears to never be used.

let my_uniform = Uniform::new_inclusive(low, high);
for _ in 0..1000 {
let v: $ty = rng.sample(my_uniform);
assert!($le(low, v) && $le(v, high));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but what's written here is sufficient.

@TheIronBorn
Copy link
Collaborator Author

There is no easy SIMD leading zeros. I implemented one on my other branch using a bit twiddling hack but its performance seems too hardware dependent

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 28, 2018

All packed_simd arithmetic is wrapping by default at the moment, though the RFC indicates we might have to replace them with wrapping_ at some point https://github.com/gnzlbg/rfcs/blob/ppv/text/0000-ppv.md#wrapping-arithmetic-operations

@dhardy
Copy link
Member

dhardy commented Jul 28, 2018

Could you add a note then where we need to use wrapping? Might help avoid bugs in the future.

@TheIronBorn
Copy link
Collaborator Author

Assembly inspection confirms range calculations are done at compile-time

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 29, 2018

packed_simd has u128xN vectors which allow up to 2x faster implementation of wmul for u64xN vectors. It requires importing a platform intrinsic at the moment though.

(512-bit vectors will still need the __mulddi3 implementation)

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 29, 2018

 name                 64bit_mulddi3 ns/iter  64bit_cast ns/iter  diff ns/iter   diff %  speedup 
 simd_uniform_i64x2   12,167 (1315 MB/s)     7,580 (2110 MB/s)         -4,587  -37.70%   x 1.61 
 simd_uniform_i64x4   25,930 (1234 MB/s)     14,782 (2164 MB/s)       -11,148  -42.99%   x 1.75 

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 29, 2018

Turns out a scalar leading zeros implementation is still fast enough. I'll investigate if there's a similar slowdown for i8/16.

Edit: perhaps only because I was using flat vectors, u32x4::splat(high) etc

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 30, 2018

Scalar leading zeros is still faster even when the compiler can't (theoretically) optimize it to a single u32::leading_zeros for the u32x4::splat(range) case.

I'm currently investigating some of the other methods listed here http://www.pcg-random.org/posts/bounded-rands.html. SIMD division is extremely slow so I expect only the bitmask method will be comparable.

@pitdicker
Copy link
Contributor

pitdicker commented Jul 30, 2018

One optimization that might be worthwhile here is to check for small ranges. I.e. when the range can fit in 16 bits, don't use the full 32 and a widening multiply, but just a mask and normal multiply. I imagine this to be a advantage in the common case, because ranges are rarely the size an integer can hold.

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 30, 2018

You're suggesting with a small range of UniformInt<u32x4> we switch algorithms using u16x4? (not sure what you mean by "mask and normal multiply")

Edit: oh are you suggesting switching from u32x4.wmul(range) to (u32x4 * range) & 0xfff? I don't know how much speedup we'll get but I'll look into it. (the compiler currently doesn't actually cast to larger vectors, it does mul + mul + swizzle spread out to avoid latency. That should probably be better noted in the comments)

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 30, 2018

That gives a large speedup in an ideal situation (most likely though because it needs only half as many random bits).

A proper implementation would be difficult and likely slow because some lanes may still have a large range. We could apply it only when all the lanes are small.

(less speedup with fewer lanes, lower bit requirement is trumped by overhead)

test wide_mul::simd_uniform_i32x16          ... bench:         916 ns/iter (+/- 203) = 1117 MB/s
test wide_mul::simd_uniform_i32x2           ... bench:         102 ns/iter (+/- 21) = 1254 MB/s
test wide_mul::simd_uniform_i32x4           ... bench:         247 ns/iter (+/- 50) = 1036 MB/s
test wide_mul::simd_uniform_i32x8           ... bench:         390 ns/iter (+/- 104) = 1312 MB/s
test wide_mul_small::simd_uniform_i32x16    ... bench:         202 ns/iter (+/- 89) = 5069 MB/s
test wide_mul_small::simd_uniform_i32x2     ... bench:          39 ns/iter (+/- 40) = 3282 MB/s
test wide_mul_small::simd_uniform_i32x4     ... bench:         136 ns/iter (+/- 110) = 1882 MB/s
test wide_mul_small::simd_uniform_i32x8     ... bench:          85 ns/iter (+/- 87) = 6023 MB/s

@TheIronBorn
Copy link
Collaborator Author

TheIronBorn commented Jul 31, 2018

It might be a nice optimization for shuffles or other cases where we know more about the range.

@TheIronBorn
Copy link
Collaborator Author

The bitmask method seems better

 name                        wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
-simd_uniform_i16x16         13,282 (2467 MB/s)  42,491 (771 MB/s)             29,209  219.91%   x 0.31 
-simd_uniform_i16x2          4,446 (921 MB/s)    18,856 (217 MB/s)             14,410  324.11%   x 0.24 
-simd_uniform_i16x32         38,002 (1724 MB/s)  87,600 (748 MB/s)             49,598  130.51%   x 0.43 
-simd_uniform_i16x4          4,656 (1759 MB/s)   20,315 (403 MB/s)             15,659  336.32%   x 0.23 
-simd_uniform_i16x8          6,408 (2556 MB/s)   24,405 (671 MB/s)             17,997  280.85%   x 0.26 
+simd_uniform_i32x16         50,561 (1296 MB/s)  26,833 (2442 MB/s)           -23,728  -46.93%   x 1.88 
-simd_uniform_i32x2          5,257 (1558 MB/s)   5,411 (1513 MB/s)                154    2.93%   x 0.97 
+simd_uniform_i32x4          11,600 (1412 MB/s)  8,789 (1864 MB/s)             -2,811  -24.23%   x 1.32 
+simd_uniform_i32x8          19,774 (1657 MB/s)  14,907 (2198 MB/s)            -4,867  -24.61%   x 1.33 
+simd_uniform_i64x2          9,799 (1672 MB/s)   8,965 (1827 MB/s)               -834   -8.51%   x 1.09 
+simd_uniform_i64x4          16,291 (2011 MB/s)  14,064 (2329 MB/s)            -2,227  -13.67%   x 1.16 
+simd_uniform_i64x8          30,050 (2180 MB/s)  28,011 (2339 MB/s)            -2,039   -6.79%   x 1.07 
+simd_uniform_i8x16          16,832 (973 MB/s)   12,637 (1296 MB/s)            -4,195  -24.92%   x 1.33 
+simd_uniform_i8x2           5,822 (351 MB/s)    5,363 (381 MB/s)                -459   -7.88%   x 1.09 
+simd_uniform_i8x32          24,748 (1324 MB/s)  17,520 (1870 MB/s)            -7,228  -29.21%   x 1.41 
+simd_uniform_i8x4           7,536 (543 MB/s)    6,640 (616 MB/s)                -896  -11.89%   x 1.13 
+simd_uniform_i8x64          147,202 (445 MB/s)  109,336 (599 MB/s)           -37,866  -25.72%   x 1.35 
+simd_uniform_i8x8           9,910 (826 MB/s)    9,794 (836 MB/s)                -116   -1.17%   x 1.01 
+simd_uniform_single_i16x16  82,276 (398 MB/s)   63,487 (516 MB/s)            -18,789  -22.84%   x 1.30 
+simd_uniform_single_i16x2   22,122 (185 MB/s)   16,473 (248 MB/s)             -5,649  -25.54%   x 1.34 
-simd_uniform_single_i16x32  175,026 (374 MB/s)  404,042 (162 MB/s)           229,016  130.85%   x 0.43 
+simd_uniform_single_i16x4   35,129 (233 MB/s)   21,397 (382 MB/s)            -13,732  -39.09%   x 1.64 
+simd_uniform_single_i16x8   27,776 (589 MB/s)   26,615 (615 MB/s)             -1,161   -4.18%   x 1.04 
+simd_uniform_single_i32x16  81,288 (806 MB/s)   53,903 (1215 MB/s)           -27,385  -33.69%   x 1.51 
+simd_uniform_single_i32x2   9,129 (897 MB/s)    8,398 (975 MB/s)                -731   -8.01%   x 1.09 
+simd_uniform_single_i32x4   17,449 (938 MB/s)   14,151 (1157 MB/s)            -3,298  -18.90%   x 1.23 
+simd_uniform_single_i32x8   40,181 (815 MB/s)   32,625 (1004 MB/s)            -7,556  -18.80%   x 1.23 
+simd_uniform_single_i64x2   24,001 (682 MB/s)   11,544 (1419 MB/s)           -12,457  -51.90%   x 2.08 
+simd_uniform_single_i64x4   40,520 (808 MB/s)   23,478 (1395 MB/s)           -17,042  -42.06%   x 1.73 
+simd_uniform_single_i64x8   85,092 (770 MB/s)   48,127 (1361 MB/s)           -36,965  -43.44%   x 1.77 
+simd_uniform_single_i8x16   66,271 (247 MB/s)   54,895 (298 MB/s)            -11,376  -17.17%   x 1.21 
+simd_uniform_single_i8x2    23,645 (86 MB/s)    18,929 (108 MB/s)             -4,716  -19.95%   x 1.25 
-simd_uniform_single_i8x32   139,309 (235 MB/s)  348,528 (94 MB/s)            209,219  150.18%   x 0.40 
+simd_uniform_single_i8x4    28,646 (142 MB/s)   23,976 (170 MB/s)             -4,670  -16.30%   x 1.19 
-simd_uniform_single_i8x64   967,238 (67 MB/s)   986,099 (66 MB/s)             18,861    1.95%   x 0.98 
+simd_uniform_single_i8x8    29,513 (277 MB/s)   27,354 (299 MB/s)             -2,159   -7.32%   x 1.08 

benchmarks here https://github.com/TheIronBorn/rand/blob/simd/benches/simd_uniform_int.rs. (The single modulo and double division methods are too slow)

The bitmask method slows down when the range is far from the next power of two, as is happening with the simd_uniform_i16x case above.

Using a better range:

 name                 wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
+simd_uniform_i16x16  17,330 (1890 MB/s)  13,225 (2477 MB/s)            -4,105  -23.69%   x 1.31 
+simd_uniform_i16x2   7,075 (578 MB/s)    5,299 (772 MB/s)              -1,776  -25.10%   x 1.34 
+simd_uniform_i16x32  51,725 (1267 MB/s)  29,134 (2249 MB/s)           -22,591  -43.68%   x 1.78 
+simd_uniform_i16x4   6,961 (1176 MB/s)   5,676 (1443 MB/s)             -1,285  -18.46%   x 1.23 
-simd_uniform_i16x8   9,380 (1746 MB/s)   11,650 (1406 MB/s)             2,270   24.20%   x 0.81 

There is an more complex bitmask method which tries to use the remaining bits of the random integer upon rejection. If it could be optimized better, it may be a further improvement.

@TheIronBorn
Copy link
Collaborator Author

The biased widening multiply method is much faster so it might be worth providing an extra, faster version.

@dhardy
Copy link
Member

dhardy commented Aug 1, 2018

@TheIronBorn let me know when you think this is ready for merging — I'm not sure if you're still tweaking.

@TheIronBorn
Copy link
Collaborator Author

If we provide an API which allows let x: u32x4 = UniformInt::new(0u32, 100u32).sample(rng) then the small widening multiplication optimization could work.

@TheIronBorn
Copy link
Collaborator Author

Still tweaking but I think either the widening multiply method or the bitmask method would work well.

The bitmask method seems best for sample_single. The approximated zone of the widening multiplication method might be slowing it down too much.

@TheIronBorn
Copy link
Collaborator Author

I think this is fine to merge now, sample_single or other optimizations can be done later

@TheIronBorn
Copy link
Collaborator Author

When the range is flat (i.e. u32x4::splat) it seems wide_mul is best. I don't think there's much use case for a varying range, so we can probably just use wide_mul for sample and bitmask for sample_single.

If you think a varying range could be used often, do speak up.

@dhardy
Copy link
Member

dhardy commented Aug 2, 2018

@TheIronBorn could you squash some of the commits together please (i.e. eliminate most of the fix/refactor commits)?

@TheIronBorn
Copy link
Collaborator Author

Squashed

@dhardy dhardy merged commit fa269c3 into rust-random:master Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D-review Do: needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants