Begin notes on implementing Select

mlochbaum · Jul 14, 2024 · a3ca1a3 · a3ca1a3
1 parent c341fe6
commit a3ca1a3
Show file tree

Hide file tree

Showing 6 changed files with 120 additions and 2 deletions.
diff --git a/docs/implementation/primitive/index.html b/docs/implementation/primitive/index.html
@@ -13,6 +13,7 @@ <h1 id="primitive-implementation-notes"><a class="header" href="#primitive-imple
 <li><a href="sort.html">Sorting</a></li>
 <li><a href="search.html">Searching</a></li>
 <li><a href="flagsort.html">Sortedness flags</a></li>
+<li><a href="select.html">Select</a></li>
 <li><a href="transpose.html">Transpose</a></li>
 <li><a href="take.html">Take and Drop</a></li>
 <li><a href="random.html">Randomness</a></li>

diff --git a/docs/implementation/primitive/search.html b/docs/implementation/primitive/search.html
@@ -19,7 +19,7 @@ <h2 id="lookup-tables"><a class="header" href="#lookup-tables">Lookup tables</a>
 <p>For the purposes of these notes, a lookup table is storage, indexed by some key, that contains at most one entry per key. This means reading the value for a given key is a simple load—differing from a hash table, which might have collisions where multiple keys indicate the same entry. Lookup table operations are very fast, but only if the table remains in cache. So they're useful when the number of possible values (that is, size of the table) is small: a 1-byte or 2-byte type, or small-range integers. You might expect the entire table has to be initialized, but it doesn't always: see <a href="#sparse-and-reverse-lookups">sparse lookups</a>.</p>
 <p>For example, a lookup table algorithm for dyadic <code><span class='Function'>⊐</span></code> might traverse <code><span class='Value'>𝕨</span></code>, writing each value's index to the table. Doing this step in reverse index order makes sure the lowest index &quot;wins&quot;. Similarly, empty entries must be initialized to <code><span class='Function'>≠</span><span class='Value'>𝕨</span></code> beforehand. Then the result is <code><span class='Value'>𝕩</span><span class='Function'>⊏</span><span class='Value'>t</span></code> where <code><span class='Value'>t</span></code> is the table constructed this way. A nonzero minimum value can be handled for free by subtracting it from the table pointer.</p>
 <p>Set operations can be handled with a packed bit table, but reading a bit is slower so this should be done only if the space savings are really needed. With sparse lookups this seems to be very rare.</p>
-<p>A 1-byte lookup can be packed into vector registers for extra-fast searching. To look up a byte, select the appropriate byte from the table with the top 5 bits, and a mask from another table with the bottom 3. Put these together and pack into bits with compare-movemask.</p>
+<p>A 1-bit lookup can be packed into vector registers for extra-fast searching, as described in <a href="select.html#small-range-selection">small-range selection</a>.</p>
 <h2 id="hash-tables"><a class="header" href="#hash-tables">Hash tables</a></h2>
 <p>A hash table is a more sophisticated design where there are more possible keys than table entries. For good performance it depends on not having too many <em>actual</em> keys packed into a small space, which is why this method is named after the hash function. If the data is expected to be random then no hash function is needed (the identity function can be used), but that's no good for search functions. Hash tables generally degrade to the performance of a linear lookup if the hash is defeated, so it's ideal to have a way to escape and use a sorting-based method if too many hashes collide.</p>
 <p>Hashing is really the only way to get a performant lookup on arbitrary data. For 2-byte and small-range data, lookups are better, and lookup with <a href="#partitioning">partitioning</a> is better for 4-byte arguments that outgrow the cache (&gt;1e5 elements to be hashed or so).</p>

diff --git a/docs/implementation/primitive/select.html b/docs/implementation/primitive/select.html
@@ -0,0 +1,50 @@
+<head>
+  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
+  <link href="../../style.css" rel="stylesheet"/>
+  <title>BQN: Implementation of Select</title>
+</head>
+<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
+<h1 id="implementation-of-select"><a class="header" href="#implementation-of-select">Implementation of Select</a></h1>
+<p><a href="../../doc/select.html">Select</a> is just a CPU load, right? Well, there is a bounds check first, but besides that, memory operations are slow. Yes, even if you use SIMD gathers—these get rid of some minor instruction dispatching overhead but they're still executed as a bunch of separate memory requests. This means that, while there's no replacement for select in the general case, there are many sub-cases that can be handled faster in other ways.</p>
+<p>The two important kinds of instructions for selection are the aforementioned <em>gather</em> instructions, which select multiple values from memory given a vector of indices, and <em>shuffle</em> instructions, which implement selection on a vector of indices and a vector of values. Shuffles don't touch memory and are many times faster.</p>
+<h2 id="bounds-checking"><a class="header" href="#bounds-checking">Bounds checking</a></h2>
+<p>Before doing any indexing you need to be sure that the indices are valid. And in a language like BQN that supports negative indices, they need to be wrapped around or otherwise accounted for. The obvious way to do this is to check before each load instruction, which is fairly expensive with scalar loads but much better with SIMD checking/wrapping and gathers.</p>
+<p>An alternative approach is to do vectorized checking in blocks on <code><span class='Value'>𝕨</span></code> before selecting with that block. In this case, instead of simply testing whether all the indices fit, it's best to compute the minimum and maximum of each block. At one min and one max instruction per vector of indices this is probably faster than comparison in most cases, and it allows for additional optimizations:</p>
+<ul>
+<li>If the minimum is at least 0, there are no negative indices, so wrapping isn't needed. And if the maximum is less than 0, there are no <em>non</em>-negative indices, and a global offset can be used instead of wrapping.</li>
+<li>If the minimum and maximum are equal, all indices are the same, so that cell of <code><span class='Value'>𝕩</span></code> can be copied without checking <code><span class='Value'>𝕨</span></code> again.</li>
+<li>More generally, if the minimum and maximum are close, an appropriate small-range method can be used.</li>
+</ul>
+<p>After the range is found, another pass can wrap indices if needed by computing <code><span class='Value'>𝕨</span><span class='Function'>+</span><span class='Value'>n</span><span class='Function'>×</span><span class='Value'>𝕨</span><span class='Function'>&lt;</span><span class='Number'>0</span></code> with <code><span class='Value'>n</span><span class='Gets'>←</span><span class='Function'>≠</span><span class='Value'>𝕩</span></code>, which vectorizes easily. Unfortunately, the range after wrapping is unknown: 0 and -1 map to 0 and n-1 which spans the entire range, and aren't ruled out by any range that requires wrapping. Mixed-sign indices should be fairly rare, so it's probably fine to ignore small-range optimization in this case, but there could be another range check built into the wrapping code. When the range is small but crosses 0, it's also possible to copy the end of <code><span class='Value'>𝕩</span></code> and then the beginning, and select from this with an offset but no wrapping.</p>
+<p>Extracting the max and min from an accumulator vector and dispatching to a special case has a non-negligible constant cost, so range checking does need to have a block size of many vector registers. This means the instructions mostly won't slip into time spent waiting on memory as they would with interleaved checks. The loss is less bad for smaller index types as SIMD range checking is faster, and I found that with generic code relying on auto-vectorization, separating the range check was better for 1-byte indices, about the same for 2-byte, and slower for larger indices. With any form of SIMD selection, incorporating a range check and wrap with selection will be faster because it makes better use of available throughput, so choosing to range-test first sacrifices raw speed for better special-case handling.</p>
+<h2 id="small-range-selection"><a class="header" href="#small-range-selection">Small-range selection</a></h2>
+<p>When <code><span class='Value'>𝕩</span></code> is small, or when values from <code><span class='Value'>𝕨</span></code> fit into a small range and thus select from a small slice of <code><span class='Value'>𝕩</span></code>, selection with vector shuffles can be much faster than memory-based selection. The relevant instructions in x86 are the SSSE3 shuffle and its AVX2 extension, which work on 1-byte values, and AVX2 vpermd (intrinsic permutevar8x32) on 4-byte values. NEON is cleaner and has all sorts of table lookup instructions called vtbl.</p>
+<p>Vector shuffles are often effective even when the selection doesn't fit into a single register (or lane). Multiple shuffles can be combined with a blend or some bit-bashing emulation. Another trick applies to data wider than indices such as selecting 2-byte values with 1-byte indices: the indices could be expanded by zipping <code><span class='Number'>2</span><span class='Function'>×</span><span class='Value'>i</span></code> with <code><span class='Number'>1</span><span class='Function'>+</span><span class='Number'>2</span><span class='Function'>×</span><span class='Value'>i</span></code>, but a better method might be to <em>unzip</em> the values <code><span class='Value'>𝕩</span></code> before starting, and do the selection by shuffling both halves and them zipping those together.</p>
+<p>The 1-byte shuffle can be used to do a 1-<em>bit</em> lookup from a table of up to 256 bits packed into vector registers. To look up a bit, select the appropriate byte from the table with the top 5 bits—if only 4 are used, a single shuffle instruction is enough, and for all 5 a blend of two shuffles is needed. Then use the bottom 3 bits to get a mask from another table where entry <code><span class='Value'>i</span></code> is <code><span class='Number'>1</span><span class='Function'>&lt;&lt;</span><span class='Value'>i</span></code>. And the mask with the byte to select the right bit, and pack into bits with a greater-than-0 comparison and movemask.</p>
+<p>SIMD selection can be relevant to <a href="search.html#lookup-tables">lookup tables</a> for search functions, particularly bit selection for Member of.</p>
+<h2 id="large-range-selection"><a class="header" href="#large-range-selection">Large-range selection</a></h2>
+<p>When the selected elements don't fit in registers (and some other special case like sortedness doesn't apply…), you'll need a memory access per element. With AVX2 or ARM SVE there's a modest improvement for using gather instructions, except with Intel CPUs prior to Skylake (2015) where that's actually slower. If the supported types are wider than the ones you want to use, you can widen the indices, or select a larger element type and then narrow the elements.</p>
+<p>For larger <code><span class='Value'>𝕩</span></code> arrays, random accesses can run out of cache space and become very expensive. As with most cache-incoherent operations, <a href="search.html#partitioning">radix partitioning</a> can be used to improve cache utilization. Partially sort the indices, perform selection, undo the sorting. However, the decision of whether to use it is much more difficult than for hash-based searches or shuffling, where accesses are random by design. Selection indices are quite likely to have some sort of locality that allows cache lines to be reused naturally. If this is the case, ordinary selection will be pretty fast, and partitioning is a lot of overhead to add to it. The only way I can think of to really test whether partitioning is needed is to sample some indices and find the number of unique cache lines represented. Figuring out how to choose the sample and interpret the statistics is tricky, and varying cache sizes and other effects will mean that you'll always get some edge cases wrong. But the downside of running a cache-incoherent selection naively is quite bad, so it's worth a try?</p>
+<h2 id="sorted-indices"><a class="header" href="#sorted-indices">Sorted indices</a></h2>
+<p>When the indices <code><span class='Value'>𝕨</span></code> in <code><span class='Value'>𝕨</span><span class='Function'>⊏</span><span class='Value'>𝕩</span></code> are sorted, slices from <code><span class='Value'>𝕨</span></code> often fit into a small range. A particular possibility of interest is <code><span class='Paren'>(</span><span class='Function'>+</span><span class='Modifier'>´</span><span class='Value'>bool</span><span class='Paren'>)</span><span class='Function'>⊏</span><span class='Value'>𝕩</span></code>, where any slice of <code><span class='Value'>𝕨</span></code> selects from an equal or smaller (fewer elements) slice of <code><span class='Value'>𝕩</span></code>.</p>
+<p>Taking any vector register out of <code><span class='Value'>𝕨</span></code>, several cases might apply:</p>
+<ul>
+<li>The indices are sparse, so the first one is far from the others. Scalar selection is needed.</li>
+<li>Only the first few indices fall within a vector of <code><span class='Value'>𝕩</span></code>. An option is to use a shuffle for these indices, then start over with a vector beginning at the one after.</li>
+<li>All indices fit into a vector of <code><span class='Value'>𝕩</span></code>. They can be handled with a shuffle, and possibly the <code><span class='Value'>𝕩</span></code> values could be retained for the next iteration.</li>
+<li>All indices are the same. Copying the selected element of <code><span class='Value'>𝕩</span></code> is still a shuffle, but maybe the same vector works for more indices too? Hitting the speed of memset if there are many equal indices would be nice.</li>
+</ul>
+<p>In the second case (partial shuffle), the number of handled indices can be found by comparing the index vector to the maximum allowed index (for example, first index plus 15), then using count-trailing-zeros on the resulting mask.</p>
+<p>Choosing between these cases at every step would have high branching costs if the indices aren't very predictable. One way to keep branching costs down is to force each method to do some minimum amount of work if chosen. A hybrid of sparse and dense selection might search for places where dense selection applies by comparing a vector of indices, plus 15, to the vector shifted by 4. It can do sparse selection up to this point; when it hits the dense selection, getting to handle 4 indices quickly would outweigh the cost of a branch (if the number used is tuned properly and not something I made up).</p>
+<p>A method with less branching is to take statistics in blocks. These might be used either as a hint, choosing between strategies that are fully general but adapted to different cases, or a proof, enabling a strategy that has some requirement. Some strategies that might be chosen this way are:</p>
+<ul>
+<li>Scalar selection.</li>
+<li>Take a vector of indices, do as many as possible (it's at least 1), repeat until finished.</li>
+<li>A hybrid dense/sparse method like the one just discussed.</li>
+<li>Full vector selection, requiring indices 16 apart to differ by ≤16.</li>
+<li>memset, requiring indices to all be the same.</li>
+</ul>
+<h2 id="select-cells"><a class="header" href="#select-cells">Select-cells</a></h2>
+<p>Applying the same selection to each cell of an array, that is, <code><span class='Value'>inds</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span><span class='Modifier'>˘</span> <span class='Value'>arr</span></code>, can be optimized similarly to small-range selection and is relevant to other array manipulation such as multidimensional Take, Drop, Replicate, and Rotate. For example <code><span class='Value'>r</span><span class='Modifier2'>⊸</span><span class='Function'>/</span><span class='Modifier'>˘</span> <span class='Value'>𝕩</span></code> may be best implemented as <code><span class='Paren'>(</span><span class='Function'>/</span><span class='Value'>r</span><span class='Paren'>)</span><span class='Modifier2'>⊸</span><span class='Function'>⊏</span><span class='Modifier'>˘</span> <span class='Value'>𝕩</span></code>. It can also be applied to operations like <code><span class='Function'>⍉</span><span class='Modifier2'>⎉</span><span class='Number'>2</span></code> on multiple trailing axes, by treating them as a single axis and constructing indices appropriately.</p>
+<p>The basic idea is to optimize if each argument and result &quot;row&quot; fits in a vector, and apply shuffles to each row, over-reading argument and overlapping result rows as necessary. If multiple rows fit in a vector, the indices can be extended accordingly, adding another copy plus the cell size of <code><span class='Value'>𝕩</span></code>, another plus twice the cell size, and so on. This requires a masked write at the end, which is an improvement as the single-row strategy might require multiple masked writes. There's no obvious way to extend the selection size to a full vector, as (assuming argument and result cell sizes are the same for simplicity) a given aligned vector in the result may take values not just from the corresponding argument vector, but from adjacent ones as well.</p>
+<p>Multi-vector selection can also be used of course, increasing the argument row size that can be handled. Because <code><span class='Value'>𝕨</span></code> will be reused, it can be preprocessed to make this more effective. If the selection instruction returns 0 for some indices (x86 shuffles do this if the top bit is set), <code><span class='Value'>𝕨</span></code> can be split so that the selection step looks like the bitwise or of multiple selections, <code><span class='Value'>sel</span><span class='Paren'>(</span><span class='Value'>a</span><span class='Separator'>,</span><span class='Value'>x</span><span class='Paren'>)</span><span class='Function'>|</span><span class='Value'>sel</span><span class='Paren'>(</span><span class='Value'>b</span><span class='Separator'>,</span><span class='Value'>x</span><span class='Paren'>)</span><span class='Function'>|</span><span class='Value'>…</span></code>. There's an index vector for each index range <code><span class='Paren'>(</span><span class='Value'>k</span><span class='Function'>×</span><span class='Number'>16</span><span class='Paren'>)</span> <span class='Function'>≤</span> <span class='Value'>i</span> <span class='Function'>&lt;</span> <span class='Paren'>(</span><span class='Value'>k</span><span class='Function'>+</span><span class='Number'>1</span><span class='Paren'>)</span><span class='Function'>×</span><span class='Number'>16</span></code> which is set to <code><span class='Value'>i</span><span class='Function'>-</span><span class='Value'>k</span><span class='Function'>×</span><span class='Number'>16</span></code> for indices <code><span class='Value'>i</span></code> that fall in the range and 128 for other indices. If zeroing selection isn't available, other possibilities with blend or xor aren't too much worse. Increasing the result row size is much simpler, just do multiple selections on the same set of argument vectors. For <code><span class='Value'>x</span></code> input vectors and <code><span class='Value'>y</span></code> results you need <code><span class='Value'>x</span><span class='Function'>×</span><span class='Value'>y</span></code> index vectors, so vector selection quickly becomes impractical as sizes increase.</p>
diff --git a/implementation/primitive/README.md b/implementation/primitive/README.md
@@ -10,6 +10,7 @@ Commentary on the best methods I know for implementing various primitives. Often
 - [Sorting](sort.md)
 - [Searching](search.md)
 - [Sortedness flags](flagsort.md)
+- [Select](select.md)
 - [Transpose](transpose.md)
 - [Take and Drop](take.md)
 - [Randomness](random.md)