Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve slice::binary_search_by #127007

Closed
wants to merge 1 commit into from

Conversation

krtab
Copy link
Contributor

@krtab krtab commented Jun 26, 2024

This PR aims to improve the performances of std::slice::binary_search.

EDIT: The proposed implementation changed so the rest of this comment is outdated. See #127007 (comment) for an up to date presentation of the PR.

It reduces the total instruction count for the u32 monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86).

It changes test_binary_search_implementation_details() so may warrant a crater run.

I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first?

Before the PR

        mov     eax, 1
        test    rsi, rsi
        je      .LBB0_1
        mov     rcx, rdx
        mov     rdx, rsi
        mov     ecx, dword ptr [rcx]
        xor     esi, esi
        mov     r8, rdx
.LBB0_3:
        shr     rdx
        add     rdx, rsi
        mov     r9d, dword ptr [rdi + 4*rdx]
        cmp     r9d, ecx
        je      .LBB0_4
        lea     r10, [rdx + 1]
        cmp     r9d, ecx
        cmova   r8, rdx
        cmovb   rsi, r10
        mov     rdx, r8
        sub     rdx, rsi
        ja      .LBB0_3
        mov     rdx, rsi
        ret
.LBB0_1:
        xor     edx, edx
        ret
.LBB0_4:
        xor     eax, eax
        ret

After the PR

        mov     ecx, dword ptr [rdx]
        xor     eax, eax
        xor     edx, edx
.LBB1_1:
        cmp     rsi, 1
        jbe     .LBB1_2
        mov     r9, rsi
        shr     r9
        lea     r8, [r9 + rdx]
        sub     rsi, r9
        cmp     dword ptr [rdi + 4*r8], ecx
        cmovb   rdx, r8
        cmova   rsi, r9
        jne     .LBB1_1
        mov     rdx, r8
        ret
.LBB1_2:
        test    rsi, rsi
        je      .LBB1_3
        xor     eax, eax
        cmp     dword ptr [rdi + 4*rdx], ecx
        setne   al
        adc     rdx, 0
        ret
.LBB1_3:
        mov     eax, 1
        ret

@rustbot
Copy link
Collaborator

rustbot commented Jun 26, 2024

r? @m-ou-se

rustbot has assigned @m-ou-se.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 26, 2024
@krtab krtab force-pushed the improv_binary_search branch 2 times, most recently from 7aa251c to e4a10ae Compare June 26, 2024 21:34
@krtab krtab marked this pull request as draft June 26, 2024 21:48
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@krtab
Copy link
Contributor Author

krtab commented Jun 26, 2024

I've run benches with ./x bench core --test-args binary_search and so far this PR is unfortunately not improving. This is likely because the added number of iterations is not compensated by the reduced number of iterations.

@krtab krtab force-pushed the improv_binary_search branch 2 times, most recently from 27237e5 to a545bb3 Compare June 26, 2024 23:40
@rust-log-analyzer

This comment has been minimized.

@krtab
Copy link
Contributor Author

krtab commented Jun 27, 2024

A different strategy bore fruits! It relies on the fact that except for ZST, slice.len is less than isize::MAX and so (right + left)/2 is an ok way to compute the mean. By treating the ZST case separately, we can "revert" 3eb5bee and win some instructions.

ZST comparing to Equal special case

Assembly:

Before

        mov     rdx, rsi
        xor     eax, eax
        test    rsi, rsi
        sete    al
        shr     rdx
        ret

After

        xor     eax, eax
        test    rsi, rsi
        sete    al
        xor     edx, edx
        ret

usize on x86_64

Assembly

Before

example::custom_binary_search_gt::hb56715903c41dc94:
        mov     eax, 1
        test    rsi, rsi
        je      .LBB0_1
        mov     rcx, rdx
        mov     rdx, rsi
        mov     rcx, qword ptr [rcx]
        xor     esi, esi
        mov     r8, rdx
.LBB0_3:
        shr     rdx
        add     rdx, rsi
        mov     r9, qword ptr [rdi + 8*rdx]
        cmp     r9, rcx
        je      .LBB0_4
        lea     r10, [rdx + 1]
        cmp     r9, rcx
        cmova   r8, rdx
        cmovb   rsi, r10
        mov     rdx, r8
        sub     rdx, rsi
        ja      .LBB0_3
        mov     rdx, rsi
        ret
.LBB0_1:
        xor     edx, edx
        ret
.LBB0_4:
        xor     eax, eax
        ret

After

example::custom_binary_search_1::hc0743890bf3b6f85:
        mov     rcx, qword ptr [rdx]
        xor     eax, eax
        xor     edx, edx
.LBB1_1:
        cmp     rsi, rdx
        jbe     .LBB1_2
        lea     r8, [rsi + rdx]
        shr     r8
        lea     r9, [r8 + 1]
        cmp     qword ptr [rdi + 8*r8], rcx
        cmovb   rdx, r9
        cmova   rsi, r8
        jne     .LBB1_1
        mov     rdx, r8
        ret
.LBB1_2:
        mov     eax, 1
        ret

Benches

Before

$ ./x bench core --test-args binary_search
benchmarks:
    slice::binary_search_l1            20.69/iter +/- 0.45
    slice::binary_search_l1_with_dups  12.17/iter +/- 0.06
    slice::binary_search_l1_worst_case  9.21/iter +/- 0.03
    slice::binary_search_l2            31.81/iter +/- 0.25
    slice::binary_search_l2_with_dups  21.05/iter +/- 0.29
    slice::binary_search_l2_worst_case 15.46/iter +/- 0.32
    slice::binary_search_l3            75.05/iter +/- 0.95
    slice::binary_search_l3_with_dups  53.66/iter +/- 0.89
    slice::binary_search_l3_worst_case 27.74/iter +/- 0.03

After

$ ./x bench core --test-args binary_search
benchmarks:
    slice::binary_search_l1            20.19/iter +/- 3.60
    slice::binary_search_l1_with_dups  14.24/iter +/- 0.57
    slice::binary_search_l1_worst_case  8.77/iter +/- 3.00
    slice::binary_search_l2            29.81/iter +/- 0.64
    slice::binary_search_l2_with_dups  19.45/iter +/- 1.32
    slice::binary_search_l2_worst_case 14.08/iter +/- 0.65
    slice::binary_search_l3            70.80/iter +/- 6.43
    slice::binary_search_l3_with_dups  52.35/iter +/- 2.77
    slice::binary_search_l3_worst_case 23.62/iter +/- 2.73

@lqd
Copy link
Member

lqd commented Jun 29, 2024

Could we start with a timer run first?

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 29, 2024
@bors
Copy link
Contributor

bors commented Jun 29, 2024

⌛ Trying commit 2f5eec9 with merge 3dc1b1e...

bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 29, 2024
Improve slice::binary_search_by

This PR aims to improve the performances of std::slice::binary_search.

**EDIT: The proposed implementation changed so the rest of this comment is outdated. See rust-lang#127007 (comment) for an up to date presentation of the PR.**

It reduces the total instruction count for the `u32` monomorphization, but maybe more remarkably, removes 2 of the 12 instructions of the main loop (on x86).

It changes `test_binary_search_implementation_details()` so may warrant a crater run.

I will document it much more if this is shown to be interesting on benchmarks. Could we start with a timer run first?

**Before the PR**

```asm
        mov     eax, 1
        test    rsi, rsi
        je      .LBB0_1
        mov     rcx, rdx
        mov     rdx, rsi
        mov     ecx, dword ptr [rcx]
        xor     esi, esi
        mov     r8, rdx
.LBB0_3:
        shr     rdx
        add     rdx, rsi
        mov     r9d, dword ptr [rdi + 4*rdx]
        cmp     r9d, ecx
        je      .LBB0_4
        lea     r10, [rdx + 1]
        cmp     r9d, ecx
        cmova   r8, rdx
        cmovb   rsi, r10
        mov     rdx, r8
        sub     rdx, rsi
        ja      .LBB0_3
        mov     rdx, rsi
        ret
.LBB0_1:
        xor     edx, edx
        ret
.LBB0_4:
        xor     eax, eax
        ret
```

**After the PR**

```asm
        mov     ecx, dword ptr [rdx]
        xor     eax, eax
        xor     edx, edx
.LBB1_1:
        cmp     rsi, 1
        jbe     .LBB1_2
        mov     r9, rsi
        shr     r9
        lea     r8, [r9 + rdx]
        sub     rsi, r9
        cmp     dword ptr [rdi + 4*r8], ecx
        cmovb   rdx, r8
        cmova   rsi, r9
        jne     .LBB1_1
        mov     rdx, r8
        ret
.LBB1_2:
        test    rsi, rsi
        je      .LBB1_3
        xor     eax, eax
        cmp     dword ptr [rdi + 4*rdx], ecx
        setne   al
        adc     rdx, 0
        ret
.LBB1_3:
        mov     eax, 1
        ret
```
@bors
Copy link
Contributor

bors commented Jun 29, 2024

☀️ Try build successful - checks-actions
Build commit: 3dc1b1e (3dc1b1e098adb3f6866155993a91f1cad3513a9c)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (3dc1b1e): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.5% [-0.5%, -0.5%] 1
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary 0.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.5% [2.5%, 2.5%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.5% [-2.5%, -2.5%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [-2.5%, 2.5%] 2

Cycles

Results (primary -2.6%, secondary 2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.1% [2.1%, 2.1%] 1
Improvements ✅
(primary)
-2.6% [-2.6%, -2.6%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -2.6% [-2.6%, -2.6%] 1

Binary size

Results (primary -0.0%, secondary -0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.2%] 4
Regressions ❌
(secondary)
0.5% [0.5%, 0.5%] 1
Improvements ✅
(primary)
-0.1% [-0.2%, -0.1%] 6
Improvements ✅
(secondary)
-0.1% [-0.2%, -0.0%] 41
All ❌✅ (primary) -0.0% [-0.2%, 0.2%] 10

Bootstrap: 694.857s -> 695.931s (0.15%)
Artifact size: 324.45 MiB -> 324.42 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 29, 2024
@Mili-ssm
Copy link

Mili-ssm commented Jun 29, 2024

Hi, i discover this PR this morning, yesterday a post on zulip about some ideas on how to improve the STD binary search. If it is worth or usefull here is the post on zulip Proposal for Binary Search Algorithm Enhancing.

I am not sure if is desirable, better or worst but if the algorith is under reviw maybe is helpful.

@krtab krtab marked this pull request as ready for review July 1, 2024 11:56
Except for ZSTs, `[T].len()` is less than `isize::MAX`.
By treating the ZST case separately, we can "revert"
3eb5bee, remove the `size` local variable
and win two instructions.
@krtab
Copy link
Contributor Author

krtab commented Jul 1, 2024

All benchmarks are good and I added comments, so this is ready for review.

Comment on lines +2820 to +2822
// SAFETY: We have that left < right, so
// 0 <= left <= mid < right <= self.len()
// and the indexing is in-bounds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the safety comment, there could be an explicit mention of where the mid relations come from.
As a reader, it's easier to evaluate the logic in the comment without needing to consult the line of code above at the same time.

Suggested change
// SAFETY: We have that left < right, so
// 0 <= left <= mid < right <= self.len()
// and the indexing is in-bounds.
// SAFETY: The while condition ensures left < right,
// so `left <= (left + right) / 2 < right`. Thus,
// 0 <= left <= mid < right <= self.len()
// and the indexing is in-bounds.

@Miksel12
Copy link

Miksel12 commented Jul 1, 2024

This binary search implementation by @orlp might also be interesting to look at: https://orlp.net/blog/bitwise-binary-search/

@orlp
Copy link
Contributor

orlp commented Jul 1, 2024

@Miksel12 That article is mostly for educational purposes. I believe this to be the fastest pattern possible: https://mhdm.dev/posts/sb_lower_bound/#update (if you can get the compiler to generate cmovs).

@krtab
Copy link
Contributor Author

krtab commented Jul 2, 2024

@Miksel12 That article is mostly for educational purposes. I believe this to be the fastest pattern possible: https://mhdm.dev/posts/sb_lower_bound/#update (if you can get the compiler to generate cmovs).

The massive difference is that this version does not have an early return for the equality case. Removing this early return from our implementation would also bring the instruction count down but I don't think this is what we want.

@orlp
Copy link
Contributor

orlp commented Jul 2, 2024

@krtab I don't think the early return is valuable. It only comes into play if you're exceptionally lucky, and then it only saves O(log n) iterations in the best case. It adds an extra branch to every iteration every time you're not exceptionally lucky.

To be more precise, it is expected that the number of comparisons you save for random data is 1/2 * (0 + 1/2 * (1 + 1/2 * 2 + ... = 1/2 + 1/4 + 1/8 + 1/16 + ... = 1. That's right, the early return on average for random data only saves a single comparison.

For what it's worth, in my experience you almost always want partition_point and not binary_search.

@Amanieu
Copy link
Member

Amanieu commented Jul 26, 2024

In #128254 I actually revert to the old algorithm which turns out to be much faster once you ensure that LLVM doesn't turn CMOV into branches.

@bors
Copy link
Contributor

bors commented Aug 2, 2024

☔ The latest upstream changes (presumably #128254) made this pull request unmergeable. Please resolve the merge conflicts.

@Amanieu
Copy link
Member

Amanieu commented Sep 18, 2024

Superseded by #128254. However I still want to thank @krtab for your effort in looking into this.

@Amanieu Amanieu closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.