implement SIMD UniformInt #561

TheIronBorn · 2018-07-19T02:51:18Z

Not sure what we should do about benchmarks. We really ought to have some benchmarks for UniformInt but there are 36 different vectors we could benchmark.

TheIronBorn · 2018-07-19T04:58:52Z

I don’t know wasm well enough to know why the test is failing

pitdicker · 2018-07-19T20:36:03Z

I don’t know wasm well enough to know why the test is failing

And it worked yesterday 😢. Seems to be rust-lang/rust#52353

TheIronBorn · 2018-07-20T03:08:39Z

I've got a sample_single implementation in another branch with benchmarks: https://github.com/TheIronBorn/rand/blob/simd/benches/gen_range.rs

The performance seems very hardware dependent

sicking

This looks great to me FWIW!!

sicking · 2018-07-21T04:25:20Z

src/distributions/uniform.rs

+                // The `range == 0` case is complicated with multiple lanes,
+                // avoid it.
+                assert!(range.eq($unsigned::splat(0)).none(),
+                        "Uniform::new_inclusive called with the full range");


One way to deal actually implement this would be to set zone to whatever here for lanes where range==0. Then in sample, change the return statement to:

return range.ne($unsigned::splat(0)).select(self.low + $ty::from(hi), v);

It obviously adds some overhead, but so does our handling in the non-simd case.

It's possible that the overhead is bigger here though. In particular I'm not sure if the $unsigned::splat(0) call will get compiled into something fast. It would for non-simd types, but I don't know SIMD instructions well enough to know if that's the case here.

If there's no way to make the splat operation fast, you could also try:

return range.ne(hi).select(self.low + $ty::from(hi), v);

sicking · 2018-07-21T04:26:45Z

src/distributions/uniform.rs

+            fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X {
+                let range = $unsigned::from(self.range);
+                // TODO: ensure these casts are a no-op
+                let zone = $unsigned::from($signed::from(self.zone));


I think simply doing

$unsigned::from($signed::from(self.zone));

should work. We need multiple casts in the non-simd case because we're also changing the bit-width, but that's not the case here AFAICT.

Err.. I mean

$unsigned::from(self.zone);

sicking · 2018-07-21T04:31:29Z

src/distributions/uniform.rs

+                    $(
+                        let v: &[($ty, $ty)] = &[
+                            ($ty::splat(0), $ty::splat(10)),
+                            ($ty::splat(10), $ty::splat(127)),


Could you fold this into the existing macro by using FloatSIMDUtils::cast_from_int and FloatSIMDUtils::all_le?

Or rather, replace the existing macro with your new macro, and just add the scalar types to it.

I guess we'd still have to keep the full-range tests separate unless we make the changes above.

sicking · 2018-07-21T04:43:54Z

I should also add that my approval is entirely non-official. It just means that I think the code looks good 😃

TheIronBorn · 2018-07-21T21:23:00Z

Here's the _mm_mullo_pi16 PR: rust-lang/stdarch#478

Currently investigating the range == 0 case.

TheIronBorn · 2018-07-22T22:45:12Z

Added code to handle the full range case. Builds fail because of the stdsimd changes. I can fix that when those changes stabilize.

There seems to be no significant change in benchmarks

 name                        no_full_range_support ns/iter  full_range_support ns/iter  diff ns/iter   diff %  speedup
+simd::distr_uniform_i16x16  13,230 (2418 MB/s)             12,070 (2651 MB/s)                -1,160   -8.77%   x 1.10 
+simd::distr_uniform_i16x2   8,769 (456 MB/s)               7,062 (566 MB/s)                  -1,707  -19.47%   x 1.24 
+simd::distr_uniform_i16x32  43,309 (1477 MB/s)             35,561 (1799 MB/s)                -7,748  -17.89%   x 1.22 
+simd::distr_uniform_i16x4   7,993 (1000 MB/s)              6,250 (1280 MB/s)                 -1,743  -21.81%   x 1.28 
+simd::distr_uniform_i16x8   3,695 (4330 MB/s)              3,409 (4693 MB/s)                   -286   -7.74%   x 1.08 
-simd::distr_uniform_i32x16  46,434 (1378 MB/s)             54,196 (1180 MB/s)                 7,762   16.72%   x 0.86 
-simd::distr_uniform_i32x2   6,253 (1279 MB/s)              7,861 (1017 MB/s)                  1,608   25.72%   x 0.80 
-simd::distr_uniform_i32x4   7,447 (2148 MB/s)              8,922 (1793 MB/s)                  1,475   19.81%   x 0.83 
+simd::distr_uniform_i32x8   18,559 (1724 MB/s)             17,130 (1868 MB/s)                -1,429   -7.70%   x 1.08 
+simd::distr_uniform_i64x2   6,651 (2405 MB/s)              6,313 (2534 MB/s)                   -338   -5.08%   x 1.05 
-simd::distr_uniform_i64x4   15,708 (2037 MB/s)             18,424 (1736 MB/s)                 2,716   17.29%   x 0.85 
+simd::distr_uniform_i64x8   27,984 (2287 MB/s)             27,929 (2291 MB/s)                   -55   -0.20%   x 1.00 
+simd::distr_uniform_i8x16   15,296 (1046 MB/s)             12,748 (1255 MB/s)                -2,548  -16.66%   x 1.20 
+simd::distr_uniform_i8x2    10,303 (194 MB/s)              8,399 (238 MB/s)                  -1,904  -18.48%   x 1.23 
+simd::distr_uniform_i8x32   27,191 (1176 MB/s)             25,844 (1238 MB/s)                -1,347   -4.95%   x 1.05 
+simd::distr_uniform_i8x4    10,192 (392 MB/s)              10,137 (394 MB/s)                    -55   -0.54%   x 1.01 
+simd::distr_uniform_i8x64   146,547 (436 MB/s)             120,560 (530 MB/s)               -25,987  -17.73%   x 1.22 
-simd::distr_uniform_i8x8    11,385 (702 MB/s)              13,822 (578 MB/s)                  2,437   21.41%   x 0.82

sicking

This looks great! I'll defer to others for the test stuff.

sicking · 2018-07-23T04:04:23Z

src/distributions/uniform.rs

+                let ints_to_reject = (unsigned_max - range + 1) % modulo;
+                // `unsigned_max - 0 + 1 % x` => `0 % x` => `0` so
+                // the zone for the full range case is `unsigned_max` which
+                // means only one sample is needed.


Any value is ok since the sample code does (rand * range).low <= zone. But since range is 0 the low part will always be zero. So the test will always pass. But unsigned_max here is as good as any other value.

unsigned_max is nice as it allows a faster select. That and "range is 0" should probably be documented over this though.

sicking · 2018-07-23T04:15:21Z

src/distributions/uniform.rs

+                    let my_uniform = Uniform::new_inclusive(low, high);
+                    for _ in 0..1000 {
+                        let v: $ty = rng.sample(my_uniform);
+                        assert!($le(low, v) && $le(v, high));


You could introduce something like FloatSIMDUtils and FloatAsSIMD::splat, but for integer types. That would allow a lot of this test code to be written more nicely.

But I'll defer to @dhardy and @pitdicker.

It could, but what's written here is sufficient.

TheIronBorn · 2018-07-27T18:31:06Z

Updated for packed_simd

TheIronBorn · 2018-07-27T18:33:28Z

If we want to include benchmarks, 16-bit, 32-bit, and 64-bit lane widths will likely have very different performance on most hardware. At the very least doing benchmarks for u16x8, u32x4, u64x2, or other minimally supported widths might be sufficient.

dhardy

As has already been said, this looks really good! Thanks @TheIronBorn!

There's no sample_single implementation; possibly it could benefit from the same range << range.leading_zeros() optimisation, but at this stage it doesn't seem important.

A couple of little points about arithmetic here — if it's wrapping by default (unlike other Rust numeric types) then there are no issues. (But are there plans to change this in the future?)

dhardy · 2018-07-27T12:53:58Z

src/distributions/uniform.rs

+
+            #[inline] // if the range is constant, this helps LLVM to do the
+                      // calculations at compile-time.
+                      // TODO: test this ^ for SIMD


Are you planning on testing? It seems fine. Note that #[inline] is needed to allow inlining in external crates, unless using (full) link-time-optimisation, so although the compiler might inline this anyway in internal tests, it wouldn't otherwise from a separate lib/bin.

dhardy · 2018-07-28T10:02:59Z

src/distributions/uniform.rs

+                        "Uniform::new_inclusive called with `low > high`");
+                let unsigned_max = ::core::$u_scalar::MAX;
+
+                let range: $unsigned = ((high - low) + 1).cast();


Is SIMD arithmetic wrapping by default?

dhardy · 2018-07-28T10:08:33Z

src/distributions/uniform.rs

+                // replacing 0 with `unsigned_max` allows a faster `select`
+                // with bitwise OR
+                let modulo = not_full_range.select(range, $unsigned::splat(unsigned_max));
+                let ints_to_reject = (unsigned_max - range + 1) % modulo;


If range is 0, this will overflow — or wrap to 0?

dhardy · 2018-07-28T10:22:10Z

src/distributions/uniform.rs

+    // bulk implementation
+    ($(($unsigned:ident, $signed:ident),)+ $u_scalar:ident) => {
+        $(
+            uniform_simd_int_impl!($unsigned, $unsigned, $signed, $u_scalar);


The $signed parameter appears to never be used.

dhardy · 2018-07-28T10:45:32Z

src/distributions/uniform.rs

+                    let my_uniform = Uniform::new_inclusive(low, high);
+                    for _ in 0..1000 {
+                        let v: $ty = rng.sample(my_uniform);
+                        assert!($le(low, v) && $le(v, high));


It could, but what's written here is sufficient.

TheIronBorn · 2018-07-28T16:25:52Z

There is no easy SIMD leading zeros. I implemented one on my other branch using a bit twiddling hack but its performance seems too hardware dependent

TheIronBorn · 2018-07-28T17:15:51Z

All packed_simd arithmetic is wrapping by default at the moment, though the RFC indicates we might have to replace them with wrapping_ at some point https://github.com/gnzlbg/rfcs/blob/ppv/text/0000-ppv.md#wrapping-arithmetic-operations

dhardy · 2018-07-28T18:01:08Z

Could you add a note then where we need to use wrapping? Might help avoid bugs in the future.

TheIronBorn · 2018-07-28T21:39:48Z

Assembly inspection confirms range calculations are done at compile-time

TheIronBorn · 2018-07-29T00:05:18Z

packed_simd has u128xN vectors which allow up to 2x faster implementation of wmul for u64xN vectors. It requires importing a platform intrinsic at the moment though.

(512-bit vectors will still need the __mulddi3 implementation)

TheIronBorn · 2018-07-29T05:00:16Z

 name                 64bit_mulddi3 ns/iter  64bit_cast ns/iter  diff ns/iter   diff %  speedup 
 simd_uniform_i64x2   12,167 (1315 MB/s)     7,580 (2110 MB/s)         -4,587  -37.70%   x 1.61 
 simd_uniform_i64x4   25,930 (1234 MB/s)     14,782 (2164 MB/s)       -11,148  -42.99%   x 1.75

TheIronBorn · 2018-07-29T21:48:22Z

Turns out a scalar leading zeros implementation is still fast enough. I'll investigate if there's a similar slowdown for i8/16.

Edit: perhaps only because I was using flat vectors, u32x4::splat(high) etc

TheIronBorn · 2018-07-30T18:01:33Z

Scalar leading zeros is still faster even when the compiler can't (theoretically) optimize it to a single u32::leading_zeros for the u32x4::splat(range) case.

I'm currently investigating some of the other methods listed here http://www.pcg-random.org/posts/bounded-rands.html. SIMD division is extremely slow so I expect only the bitmask method will be comparable.

pitdicker · 2018-07-30T18:32:48Z

One optimization that might be worthwhile here is to check for small ranges. I.e. when the range can fit in 16 bits, don't use the full 32 and a widening multiply, but just a mask and normal multiply. I imagine this to be a advantage in the common case, because ranges are rarely the size an integer can hold.

TheIronBorn · 2018-07-30T18:35:09Z

~~You're suggesting with a small range of UniformInt<u32x4> we switch algorithms using u16x4? (not sure what you mean by "mask and normal multiply")~~

Edit: oh are you suggesting switching from u32x4.wmul(range) to (u32x4 * range) & 0xfff? I don't know how much speedup we'll get but I'll look into it. (the compiler currently doesn't actually cast to larger vectors, it does mul + mul + swizzle spread out to avoid latency. That should probably be better noted in the comments)

TheIronBorn · 2018-07-30T20:21:24Z

That gives a large speedup in an ideal situation (most likely though because it needs only half as many random bits).

A proper implementation would be difficult and likely slow because some lanes may still have a large range. We could apply it only when all the lanes are small.

(less speedup with fewer lanes, lower bit requirement is trumped by overhead)

test wide_mul::simd_uniform_i32x16          ... bench:         916 ns/iter (+/- 203) = 1117 MB/s
test wide_mul::simd_uniform_i32x2           ... bench:         102 ns/iter (+/- 21) = 1254 MB/s
test wide_mul::simd_uniform_i32x4           ... bench:         247 ns/iter (+/- 50) = 1036 MB/s
test wide_mul::simd_uniform_i32x8           ... bench:         390 ns/iter (+/- 104) = 1312 MB/s
test wide_mul_small::simd_uniform_i32x16    ... bench:         202 ns/iter (+/- 89) = 5069 MB/s
test wide_mul_small::simd_uniform_i32x2     ... bench:          39 ns/iter (+/- 40) = 3282 MB/s
test wide_mul_small::simd_uniform_i32x4     ... bench:         136 ns/iter (+/- 110) = 1882 MB/s
test wide_mul_small::simd_uniform_i32x8     ... bench:          85 ns/iter (+/- 87) = 6023 MB/s

TheIronBorn · 2018-07-31T04:46:32Z

It might be a nice optimization for shuffles or other cases where we know more about the range.

TheIronBorn · 2018-07-31T16:48:20Z

The bitmask method seems better

 name                        wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
-simd_uniform_i16x16         13,282 (2467 MB/s)  42,491 (771 MB/s)             29,209  219.91%   x 0.31 
-simd_uniform_i16x2          4,446 (921 MB/s)    18,856 (217 MB/s)             14,410  324.11%   x 0.24 
-simd_uniform_i16x32         38,002 (1724 MB/s)  87,600 (748 MB/s)             49,598  130.51%   x 0.43 
-simd_uniform_i16x4          4,656 (1759 MB/s)   20,315 (403 MB/s)             15,659  336.32%   x 0.23 
-simd_uniform_i16x8          6,408 (2556 MB/s)   24,405 (671 MB/s)             17,997  280.85%   x 0.26 
+simd_uniform_i32x16         50,561 (1296 MB/s)  26,833 (2442 MB/s)           -23,728  -46.93%   x 1.88 
-simd_uniform_i32x2          5,257 (1558 MB/s)   5,411 (1513 MB/s)                154    2.93%   x 0.97 
+simd_uniform_i32x4          11,600 (1412 MB/s)  8,789 (1864 MB/s)             -2,811  -24.23%   x 1.32 
+simd_uniform_i32x8          19,774 (1657 MB/s)  14,907 (2198 MB/s)            -4,867  -24.61%   x 1.33 
+simd_uniform_i64x2          9,799 (1672 MB/s)   8,965 (1827 MB/s)               -834   -8.51%   x 1.09 
+simd_uniform_i64x4          16,291 (2011 MB/s)  14,064 (2329 MB/s)            -2,227  -13.67%   x 1.16 
+simd_uniform_i64x8          30,050 (2180 MB/s)  28,011 (2339 MB/s)            -2,039   -6.79%   x 1.07 
+simd_uniform_i8x16          16,832 (973 MB/s)   12,637 (1296 MB/s)            -4,195  -24.92%   x 1.33 
+simd_uniform_i8x2           5,822 (351 MB/s)    5,363 (381 MB/s)                -459   -7.88%   x 1.09 
+simd_uniform_i8x32          24,748 (1324 MB/s)  17,520 (1870 MB/s)            -7,228  -29.21%   x 1.41 
+simd_uniform_i8x4           7,536 (543 MB/s)    6,640 (616 MB/s)                -896  -11.89%   x 1.13 
+simd_uniform_i8x64          147,202 (445 MB/s)  109,336 (599 MB/s)           -37,866  -25.72%   x 1.35 
+simd_uniform_i8x8           9,910 (826 MB/s)    9,794 (836 MB/s)                -116   -1.17%   x 1.01 
+simd_uniform_single_i16x16  82,276 (398 MB/s)   63,487 (516 MB/s)            -18,789  -22.84%   x 1.30 
+simd_uniform_single_i16x2   22,122 (185 MB/s)   16,473 (248 MB/s)             -5,649  -25.54%   x 1.34 
-simd_uniform_single_i16x32  175,026 (374 MB/s)  404,042 (162 MB/s)           229,016  130.85%   x 0.43 
+simd_uniform_single_i16x4   35,129 (233 MB/s)   21,397 (382 MB/s)            -13,732  -39.09%   x 1.64 
+simd_uniform_single_i16x8   27,776 (589 MB/s)   26,615 (615 MB/s)             -1,161   -4.18%   x 1.04 
+simd_uniform_single_i32x16  81,288 (806 MB/s)   53,903 (1215 MB/s)           -27,385  -33.69%   x 1.51 
+simd_uniform_single_i32x2   9,129 (897 MB/s)    8,398 (975 MB/s)                -731   -8.01%   x 1.09 
+simd_uniform_single_i32x4   17,449 (938 MB/s)   14,151 (1157 MB/s)            -3,298  -18.90%   x 1.23 
+simd_uniform_single_i32x8   40,181 (815 MB/s)   32,625 (1004 MB/s)            -7,556  -18.80%   x 1.23 
+simd_uniform_single_i64x2   24,001 (682 MB/s)   11,544 (1419 MB/s)           -12,457  -51.90%   x 2.08 
+simd_uniform_single_i64x4   40,520 (808 MB/s)   23,478 (1395 MB/s)           -17,042  -42.06%   x 1.73 
+simd_uniform_single_i64x8   85,092 (770 MB/s)   48,127 (1361 MB/s)           -36,965  -43.44%   x 1.77 
+simd_uniform_single_i8x16   66,271 (247 MB/s)   54,895 (298 MB/s)            -11,376  -17.17%   x 1.21 
+simd_uniform_single_i8x2    23,645 (86 MB/s)    18,929 (108 MB/s)             -4,716  -19.95%   x 1.25 
-simd_uniform_single_i8x32   139,309 (235 MB/s)  348,528 (94 MB/s)            209,219  150.18%   x 0.40 
+simd_uniform_single_i8x4    28,646 (142 MB/s)   23,976 (170 MB/s)             -4,670  -16.30%   x 1.19 
-simd_uniform_single_i8x64   967,238 (67 MB/s)   986,099 (66 MB/s)             18,861    1.95%   x 0.98 
+simd_uniform_single_i8x8    29,513 (277 MB/s)   27,354 (299 MB/s)             -2,159   -7.32%   x 1.08

benchmarks here https://github.com/TheIronBorn/rand/blob/simd/benches/simd_uniform_int.rs. (The single modulo and double division methods are too slow)

The bitmask method slows down when the range is far from the next power of two, as is happening with the simd_uniform_i16x case above.

Using a better range:

 name                 wide_mul ns/iter    bitmask_simple ns/iter  diff ns/iter   diff %  speedup 
+simd_uniform_i16x16  17,330 (1890 MB/s)  13,225 (2477 MB/s)            -4,105  -23.69%   x 1.31 
+simd_uniform_i16x2   7,075 (578 MB/s)    5,299 (772 MB/s)              -1,776  -25.10%   x 1.34 
+simd_uniform_i16x32  51,725 (1267 MB/s)  29,134 (2249 MB/s)           -22,591  -43.68%   x 1.78 
+simd_uniform_i16x4   6,961 (1176 MB/s)   5,676 (1443 MB/s)             -1,285  -18.46%   x 1.23 
-simd_uniform_i16x8   9,380 (1746 MB/s)   11,650 (1406 MB/s)             2,270   24.20%   x 0.81

There is an more complex bitmask method which tries to use the remaining bits of the random integer upon rejection. If it could be optimized better, it may be a further improvement.

TheIronBorn · 2018-07-31T17:20:25Z

The biased widening multiply method is much faster so it might be worth providing an extra, faster version.

dhardy · 2018-08-01T16:44:27Z

@TheIronBorn let me know when you think this is ready for merging — I'm not sure if you're still tweaking.

TheIronBorn · 2018-08-01T20:21:27Z

If we provide an API which allows let x: u32x4 = UniformInt::new(0u32, 100u32).sample(rng) then the small widening multiplication optimization could work.

TheIronBorn · 2018-08-01T22:23:13Z

Still tweaking but I think either the widening multiply method or the bitmask method would work well.

The bitmask method seems best for sample_single. The approximated zone of the widening multiplication method might be slowing it down too much.

TheIronBorn · 2018-08-01T22:38:16Z

I think this is fine to merge now, sample_single or other optimizations can be done later

TheIronBorn · 2018-08-02T03:38:17Z

When the range is flat (i.e. u32x4::splat) it seems wide_mul is best. I don't think there's much use case for a varying range, so we can probably just use wide_mul for sample and bitmask for sample_single.

If you think a varying range could be used often, do speak up.

dhardy · 2018-08-02T08:24:01Z

@TheIronBorn could you squash some of the commits together please (i.e. eliminate most of the fix/refactor commits)?

TheIronBorn · 2018-08-02T22:53:59Z

Squashed

dhardy added the D-review Do: needs review label Jul 20, 2018

sicking approved these changes Jul 21, 2018

View reviewed changes

sicking reviewed Jul 23, 2018

View reviewed changes

dhardy reviewed Jul 28, 2018

View reviewed changes

TheIronBorn mentioned this pull request Jul 29, 2018

u128xN casts missing rust-lang/packed_simd#49

Closed

impl SIMD wmul, UniformInt

9c12ad3

TheIronBorn force-pushed the patch-6 branch from 319b075 to 9c12ad3 Compare August 2, 2018 22:33

Merge branch 'master' into patch-6

b45e54f

dhardy merged commit fa269c3 into rust-random:master Aug 3, 2018

implement SIMD UniformInt #561

implement SIMD UniformInt #561

Conversation

TheIronBorn commented Jul 19, 2018

TheIronBorn commented Jul 19, 2018 • edited Loading

pitdicker commented Jul 19, 2018

TheIronBorn commented Jul 20, 2018

sicking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sicking commented Jul 21, 2018

TheIronBorn commented Jul 21, 2018

TheIronBorn commented Jul 22, 2018

sicking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheIronBorn commented Jul 27, 2018

TheIronBorn commented Jul 27, 2018 • edited Loading

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheIronBorn commented Jul 28, 2018

TheIronBorn commented Jul 28, 2018 • edited Loading

dhardy commented Jul 28, 2018 • edited Loading

TheIronBorn commented Jul 28, 2018

TheIronBorn commented Jul 29, 2018 • edited Loading

TheIronBorn commented Jul 29, 2018 • edited Loading

TheIronBorn commented Jul 29, 2018 • edited Loading

TheIronBorn commented Jul 30, 2018 • edited Loading

pitdicker commented Jul 30, 2018 • edited Loading

TheIronBorn commented Jul 30, 2018 • edited Loading

TheIronBorn commented Jul 30, 2018 • edited Loading

TheIronBorn commented Jul 31, 2018 • edited Loading

TheIronBorn commented Jul 31, 2018

TheIronBorn commented Jul 31, 2018

dhardy commented Aug 1, 2018

TheIronBorn commented Aug 1, 2018

TheIronBorn commented Aug 1, 2018

TheIronBorn commented Aug 1, 2018

TheIronBorn commented Aug 2, 2018

dhardy commented Aug 2, 2018

TheIronBorn commented Aug 2, 2018

TheIronBorn commented Jul 19, 2018 •

edited

Loading

TheIronBorn commented Jul 27, 2018 •

edited

Loading

TheIronBorn commented Jul 28, 2018 •

edited

Loading

dhardy commented Jul 28, 2018 •

edited

Loading

TheIronBorn commented Jul 29, 2018 •

edited

Loading

TheIronBorn commented Jul 29, 2018 •

edited

Loading

TheIronBorn commented Jul 29, 2018 •

edited

Loading

TheIronBorn commented Jul 30, 2018 •

edited

Loading

pitdicker commented Jul 30, 2018 •

edited

Loading

TheIronBorn commented Jul 30, 2018 •

edited

Loading

TheIronBorn commented Jul 30, 2018 •

edited

Loading

TheIronBorn commented Jul 31, 2018 •

edited

Loading