Async performance regression #70488

goffrie · 2020-03-28T01:14:33Z

See #69033 (comment); @kpp reports a 200%+ regression on some futures microbenchmarks, e.g.:

future::ready/async_combinators                                                                             
                        time:   [1.5699 ns 1.5743 ns 1.5793 ns]
                        change: [+279.32% +282.34% +285.20%] (p = 0.00 < 0.05)
                        Performance has regressed.

The benchmarks are at https://github.com/kpp/futures-async-combinators/.

I bisected this to 57e1da5, which most likely implicates #69837.

The text was updated successfully, but these errors were encountered:

jonas-schievink · 2020-03-28T06:19:41Z

Can reproduce. The regression comes from changing the discriminant type from u32 to u8 (or whatever fits best, which almost always is u8), exposing the niche does not regress performance much.

Isn't x86 generally less efficient when dealing with <32 bit words? In that case we might want to consider doing this optimization conditionally, since it can be a somewhat significant RAM saving on smaller microcontroller targets. I could imagine using -C opt-level=s to decide this, but I don't think there's any precedence for doing something like that at the moment.

Mark-Simulacrum · 2020-03-28T16:44:52Z

I think the slowdown is likely some sort of bug in our codegen or LLVM perhaps?

This is a single iteration on nightly (unrolled 8 times in the actual benchmark code):

    0.00 :   18a260:       mov    BYTE PTR [rsp+0x8],0x0
    0.80 :   18a265:       movzx  ecx,BYTE PTR [rsp]
    0.99 :   18a269:       movzx  ecx,BYTE PTR [rsp+0x1]
    1.09 :   18a26e:       movzx  ecx,BYTE PTR [rsp+0x2]
    0.87 :   18a273:       movzx  ecx,BYTE PTR [rsp+0x3]
    0.97 :   18a278:       movzx  ecx,BYTE PTR [rsp+0x4]
    0.83 :   18a27d:       movzx  ecx,BYTE PTR [rsp+0x5]
    1.32 :   18a282:       movzx  ecx,BYTE PTR [rsp+0x6]
    0.87 :   18a287:       movzx  ecx,BYTE PTR [rsp+0x7]
    0.84 :   18a28c:       movzx  ecx,BYTE PTR [rsp+0x9]
    0.86 :   18a291:       movzx  ecx,BYTE PTR [rsp+0xa]
    1.01 :   18a296:       movzx  ecx,BYTE PTR [rsp+0xb]
    1.04 :   18a29b:       movzx  ecx,BYTE PTR [rsp+0x8]

On the good commit, it looks like this (and is only unrolled 4 times):

   17.76 :   183f62:       mov    DWORD PTR [rsp],0x0
    0.00 :   183f69:       mov    edx,DWORD PTR [rsp+0x4]
    3.26 :   183f6d:       mov    edx,DWORD PTR [rsp+0x8]
    0.00 :   183f71:       mov    edx,DWORD PTR [rsp]

Trying to reduce this down, I came up with this example:

use std::future::Future;

pub async fn foo(v: i32) -> i32 { v }

macro_rules! wrap {
    ($v:expr) => {
        async move { ($v).await }
    }
}

pub fn bar(v: i32) -> impl Future<Output = i32> {
    wrap!(wrap!(wrap!(wrap!(wrap!(wrap!(foo(v)))))))
}

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably). I think that's the bug here, though I'm not sure how the discriminant size managed to affect that. Maybe previously we were getting split into two separate LLVM "words" via SROA or something and now we aren't since we fit into a single register or something like that? That could I guess plausibly lead to problems here; my assumption is that the benchmark is loading many more words from the stack on nightly because the "size" of the future has increased drastically? OTOH, size_of_val etc do not report a size increase -- so why the offset is increasing I'm not sure. But I suspect it's for the same reason we're seeing more stack loads.

Nightly:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     byte ptr [rdi + 52], 0
        ret

and on stable:

example::foo:
        mov     eax, edi
        ret

example::bar:
        mov     rax, rdi
        mov     dword ptr [rdi], esi
        mov     dword ptr [rdi + 4], 0
        ret

jonas-schievink · 2020-03-28T16:53:16Z

Thanks, that's helpful!

jonas-schievink · 2020-03-28T16:55:24Z

The dword vs. byte moves can be explained by the alignment changing from 4 (due to using u32 as the discriminant) to 1 (since we now use u8 as the discriminant), as long as nothing with a larger alignment is contained.

jonas-schievink · 2020-03-28T17:19:44Z

In godbolt, the nightly has an offset of 52 -- and that increases by 8 each additional wrap! -- whereas stable stays constant at 4, the size of i32 (presumably).

Not entirely certain, but I think this is just automatic field reordering doing its thing. The discriminant is now a u8 with alignment 1, so it gets moved to the very end of the memory layout to not waste space with padding. Previously it was a u32 with alignment 4, just like all the other fields in the generator, so reordering anything there doesn't make sense.

If I take your example and replace i32 with bool, the size of foo goes from 8 to 2 bytes (and its alignment from 4 to 1), while bar goes from 56 to 14 bytes (its alignment also goes from 4 to 1).

So far this looks like a fairly expected outcome of my PR; it's pretty well-known that smaller alignment can generate less efficient code (OTOH apparently modern x86 chips can do unaligned moves just as fast as aligned ones, so maybe the byte-wise moves should not be emitted, but in the end this is completely up to LLVM). Since, AFAICT, it should only affect futures that do not store any data with alignment in them, the impact should be pretty minimal in practice, so we can also consider closing this without a fix.

(this does make me wish that we had better layout debugging capabilities in rustc though)

memoryruins · 2020-03-28T19:06:09Z

this does make me wish that we had better layout debugging capabilities in rustc

A #[rustc_layout(debug)] attribute was added in #69901 (edit: heh, just noticed you're the first comment on it ^^). With the help of type_alias_impl_trait, rustc output some layout information for bar's future: playground

jonas-schievink · 2020-03-28T19:11:58Z

Ah, nice to see that land, that's certainly a step in the right direction

nagisa · 2020-03-28T20:56:44Z

@jonas-schievink a quick test to see if this is indeed an alignment problem is to stick a #[repr(align(4))]-equivalent onto the generator.

jonas-schievink · 2020-03-28T21:43:49Z

That does not seem to do anything, hmm... Maybe I'm doing it wrong though. Would be useful to get an MCVE that exhibits the inefficient codegen.

jonas-schievink · 2020-03-28T21:45:50Z

I've just checked the benchmark code, and it does not actually benchmark anything except copying the futures around. They are never polled because they are never spawned onto an executor.

The benchmark results would make sense here if the alignment is really the cause.

kpp · 2020-03-28T22:11:54Z

How so?

    executor::block_on(async {
        let mut group = c.benchmark_group("future::ready");

        group.bench_function("futures", |b| {
            b.iter(move || async {
                black_box(futures::future::ready(42)).await
            })
        });

We got block_on and .await.

jonas-schievink · 2020-03-28T22:12:00Z

Enabling Criterion's real_blackbox feature significantly speeds up the future::ready benchmarks. The default implementation does more copying.

With the feature enabled the benchmarks now take <1ns, so yeah, they were really just benchmarking the memcpy in black_box.

jonas-schievink · 2020-03-28T22:14:38Z

@kpp block_on only blocks on the outer async block, which does not await anything. The futures created by the closure passed to iter are just dropped without being spawned or awaited (which is what is actually getting benchmarked here). You can test this by putting a panic!() inside the inner async blocks. It is never reached.

kpp · 2020-03-28T22:23:31Z

Ooops. Never felt more embarrassed.

kpp · 2020-03-28T22:38:26Z

Jonas, does this sounds better to you?

fn bench_ready(c: &mut Criterion) {
    let mut group = c.benchmark_group("future::ready");

    group.bench_function("futures", |b| {
        b.iter(|| {
            executor::block_on(async {
                black_box(futures::future::ready(42)).await;
            })
        })
    });

jonas-schievink · 2020-03-28T22:47:19Z

I'd wrap the block_on in a black_box to make sure the entire thing gets evaluated (and not just the expression producing the future)

kpp · 2020-03-28T22:51:35Z

black_box(futures::future::ready(42).await); ?

jonas-schievink · 2020-03-29T20:45:41Z

Reproduction:

use std::{mem, ptr};

use std::future::Future;

async fn ready<T>(t: T) -> T {
    t
}

fn black_box<T>(dummy: T) -> T {
    unsafe {
        let ret = std::ptr::read_volatile(&dummy);
        std::mem::forget(dummy);
        ret
    }
}

#[inline(never)]
fn iter<O, R>(mut routine: R)
where
    R: FnMut() -> O,
{
    loop {
        black_box(routine());
    }
}

pub unsafe fn ready_bench() {
    iter(move || async {
        black_box(ready(42)).await
    });
}

On stable:

playground::iter:
	sub	rsp, 16

.LBB0_1:
	mov	dword ptr [rsp], 0
	mov	eax, dword ptr [rsp + 4]
	mov	eax, dword ptr [rsp + 8]
	mov	eax, dword ptr [rsp]
	jmp	.LBB0_1

playground::ready_bench:
	push	rax
	call	playground::iter
	ud2

On nightly:

playground::iter: # @playground::iter
# %bb.0:
	sub	rsp, 16

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
	mov	byte ptr [rsp + 8], 0
	movzx	eax, byte ptr [rsp]
	movzx	eax, byte ptr [rsp + 1]
	movzx	eax, byte ptr [rsp + 2]
	movzx	eax, byte ptr [rsp + 3]
	movzx	eax, byte ptr [rsp + 4]
	movzx	eax, byte ptr [rsp + 5]
	movzx	eax, byte ptr [rsp + 6]
	movzx	eax, byte ptr [rsp + 7]
	movzx	eax, byte ptr [rsp + 9]
	movzx	eax, byte ptr [rsp + 10]
	movzx	eax, byte ptr [rsp + 11]
	movzx	eax, byte ptr [rsp + 8]
	jmp	.LBB0_1
                                        # -- End function

playground::ready_bench: # @playground::ready_bench
# %bb.0:
	push	rax
	call	playground::iter
	ud2
                                        # -- End function

goffrie · 2020-03-29T20:56:31Z

Hm, I guess this is not really a bug then.

…

On Sun., Mar. 29, 2020, 13:45 Jonas Schievink, ***@***.***> wrote: Reproduction <https://play.rust-lang.org/?version=nightly&mode=release&edition=2018&gist=fc0db5e71c27b16c010db95c9053b6ac> : use std::{mem, ptr}; use std::future::Future; async fn ready<T>(t: T) -> T { t } fn black_box<T>(dummy: T) -> T { unsafe { let ret = std::ptr::read_volatile(&dummy); std::mem::forget(dummy); ret } } #[inline(never)]fn iter<O, R>(mut routine: R)where R: FnMut() -> O, { loop { black_box(routine()); } } pub unsafe fn ready_bench() { iter(move || async { black_box(ready(42)).await }); } On stable: playground::iter: sub rsp, 16 .LBB0_1: mov dword ptr [rsp], 0 mov eax, dword ptr [rsp + 4] mov eax, dword ptr [rsp + 8] mov eax, dword ptr [rsp] jmp .LBB0_1 playground::ready_bench: push rax call playground::iter ud2 On nightly: playground::iter: # @playground::iter # %bb.0: sub rsp, 16 .LBB0_1: # =>This Inner Loop Header: Depth=1 mov byte ptr [rsp + 8], 0 movzx eax, byte ptr [rsp] movzx eax, byte ptr [rsp + 1] movzx eax, byte ptr [rsp + 2] movzx eax, byte ptr [rsp + 3] movzx eax, byte ptr [rsp + 4] movzx eax, byte ptr [rsp + 5] movzx eax, byte ptr [rsp + 6] movzx eax, byte ptr [rsp + 7] movzx eax, byte ptr [rsp + 9] movzx eax, byte ptr [rsp + 10] movzx eax, byte ptr [rsp + 11] movzx eax, byte ptr [rsp + 8] jmp .LBB0_1 # -- End function playground::ready_bench: # @playground::ready_bench # %bb.0: push rax call playground::iter ud2 # -- End function — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#70488 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJO2TUM5ALCGWNW2ABK3WTRJ6XQFANCNFSM4LVMOP5Q> .

jonas-schievink · 2020-03-29T21:49:36Z

The actual LLVM type did change to have an alignment of 1:

; stable
%"std::future::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [0 x i32], i32, [2 x i32] }

; nightly
%"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>" = type { [0 x i32], %"ready_bench::{{closure}}::{{closure}}", [0 x i32] }
%"ready_bench::{{closure}}::{{closure}}" = type { [8 x i8], i8, [3 x i8] }

One odd thing is that the volatile load in the LLVM IR is passed align 8, so I'm wondering why that isn't used to make the load more efficient here.

; playground::iter
; Function Attrs: noinline noreturn nounwind nonlazybind uwtable
define internal fastcc void @_ZN10playground4iter17h1130617e37f08519E() unnamed_addr #0 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %_3 = alloca %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", align 8
  %_3.0.sroa_cast = bitcast %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3 to i8*
  %_3.8.sroa_idx = getelementptr inbounds %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, i64 0, i32 1, i32 1
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  call void @llvm.lifetime.start.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  store i8 0, i8* %_3.8.sroa_idx, align 8, !alias.scope !2
  %_3.0. = load volatile %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>", %"core::future::from_generator::GenFuture<ready_bench::{{closure}}::{{closure}}>"* %_3, align 8, !alias.scope !8, !noalias !11
  call void @llvm.lifetime.end.p0i8(i64 12, i8* nonnull %_3.0.sroa_cast)
  br label %bb1
}

LifeIsStrange · 2020-03-30T00:11:17Z

@jonas-schievink

Isn't x86 generally less efficient when dealing with <32 bit words?

Typically, CPUs are fastest at operating on integers of their native integer size. On x86, 32-bit arithmetics are can be twice as fast compare to 16-bit operands.

Cf: http://www.idryman.org/blog/2012/11/21/integer-promotion/

In C this potential cause of performance regression cannot happen because of Integer promotion

If so, couldn't rust support integer promotion? At the very least as an optional flag in order to enforce performance garanties?
In the context of this issue, toggling on this flag would allow to test the hypothesis and instantaneously resolve or not the root cause of the performance regression.

jonas-schievink · 2020-03-30T23:03:23Z

The PR that caused this intentionally did the opposite of integer promotion (narrowing a u32 to a u8), just not in a way that's visible to users. Generally rustc/LLVM is free to do integer promotion or narrowing whenever it sees fit, as long as it doesn't change the observed behavior of the program.

I haven't completely nailed down this issue yet as I'm not an LLVM expert, but it seems like it's caused by LLVM using only the LLVM type for alignment decisions, not the explicit alignment of the load (Presumably this doesn't always happen for loads or it would have been caught earlier? Are only volatile loads affected?).

LifeIsStrange · 2020-03-30T23:09:30Z

I have no idea if this is really relevant to the conversation but LLVM is introducing a new alignment type that could solve some issues http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html

tmandry · 2020-03-31T17:27:34Z

We discussed this in the wg-async-foundations meeting and several people wanted to close as won't fix. Without evidence that this causes performance regressions in real-world code, it may not be worth pursuing further. We also don't think it's worth making generators bigger or changing their alignment to improve a microbenchmark.

That said, it does seem like LLVM could be generating better assembly, so pinging those folks for more input:

@rustbot ping llvm

rustbot · 2020-03-31T17:27:37Z

Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
instructions for tackling these sorts of bugs. Maybe take a look?
Thanks! <3

cc @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj

comex · 2020-03-31T23:00:59Z

For the "reproduction", the issue isn't the alignment; it's the way the example reimplements black_box in terms of a volatile load of an entire struct, as an attempt to make it work on stable. The actual std::hint::black_box is not implemented this way; if you change the example to use it, it will only work on nightly, but it doesn't show the wasteful assembly.

Longer explanation:

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions. (Without volatile it will optimize them into something nicer.) Note that LLVM does this regardless of the size of struct, as long as the LLVM IR has a load of struct type (load volatile %AlignedBytes, %AlignedBytes* %src). But if the struct is small enough (say, 8 u8s), then rustc will instead produce LLVM IR containing a load of integer type (load volatile i64, i64* %1), so you get a single machine load instruction even though it's volatile. Interestingly, Clang does the same thing for C code. I wonder why.

Regardless, the result is unpredictable, nonintuitive, and undocumented, in both Rust and C. Luckily, nobody actually uses volatile loads/stores of entire structs in cases where it matters, like accessing hardware MMIO registers: the registers may or may not be represented in a struct, but the actual loads and stores are done to individual fields.

jonas-schievink · 2020-03-31T23:32:26Z

LLVM implements volatile loads/stores of structs as volatile loads/stores of their fields; if the struct consists of N fields, it will always emit N separate load/store instructions.

Ah, I see, that explains the emitted code, thanks! It also explains what was observed in #66595 I suppose.

Well, I don't think there's much of a point in leaving this issue open then, so closing.

MikailBag · 2020-04-01T12:28:07Z

To sum up.
Am I right than performance regression happens only when all conditions are met?

Custom black_box, based on volatile (instead of std::hint::black_box)
Future is created, but neither polled not awaited.
Future is tiny.

Centril added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. I-slow Issue: Problems and improvements with respect to performance of generated code. I-nominated A-async-await Area: Async & Await P-high High priority labels Mar 28, 2020

jonas-schievink self-assigned this Mar 28, 2020

jonas-schievink added the A-codegen Area: Code generation label Mar 28, 2020

jonas-schievink added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Mar 28, 2020

kpp mentioned this issue Mar 28, 2020

Fix benches: actually poll futures in benches kpp/futures-async-combinators#35

Merged

jonas-schievink added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Mar 30, 2020

rustbot added the ICEBreaker-LLVM Bugs identified for the LLVM ICE-breaker group label Mar 31, 2020

jonas-schievink removed their assignment Mar 31, 2020

tmandry added P-low Low priority and removed P-high High priority labels Mar 31, 2020

jonas-schievink closed this as completed Mar 31, 2020

comex mentioned this issue Apr 1, 2020

Rust compiler crashes during compilation #66595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async performance regression #70488

Async performance regression #70488

goffrie commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

Mark-Simulacrum commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

memoryruins commented Mar 28, 2020 •

edited

Loading

jonas-schievink commented Mar 28, 2020

nagisa commented Mar 28, 2020 •

edited

Loading

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

kpp commented Mar 28, 2020 •

edited

Loading

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

jonas-schievink commented Mar 29, 2020

goffrie commented Mar 29, 2020 via email

jonas-schievink commented Mar 29, 2020

LifeIsStrange commented Mar 30, 2020 •

edited

Loading

jonas-schievink commented Mar 30, 2020

LifeIsStrange commented Mar 30, 2020

tmandry commented Mar 31, 2020

rustbot commented Mar 31, 2020

comex commented Mar 31, 2020

jonas-schievink commented Mar 31, 2020

MikailBag commented Apr 1, 2020

Async performance regression #70488

Async performance regression #70488

Comments

goffrie commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

Mark-Simulacrum commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

memoryruins commented Mar 28, 2020 • edited Loading

jonas-schievink commented Mar 28, 2020

nagisa commented Mar 28, 2020 • edited Loading

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

kpp commented Mar 28, 2020 • edited Loading

jonas-schievink commented Mar 28, 2020

kpp commented Mar 28, 2020

jonas-schievink commented Mar 29, 2020

goffrie commented Mar 29, 2020 via email

jonas-schievink commented Mar 29, 2020

LifeIsStrange commented Mar 30, 2020 • edited Loading

jonas-schievink commented Mar 30, 2020

LifeIsStrange commented Mar 30, 2020

tmandry commented Mar 31, 2020

rustbot commented Mar 31, 2020

comex commented Mar 31, 2020

jonas-schievink commented Mar 31, 2020

MikailBag commented Apr 1, 2020

memoryruins commented Mar 28, 2020 •

edited

Loading

nagisa commented Mar 28, 2020 •

edited

Loading

kpp commented Mar 28, 2020 •

edited

Loading

LifeIsStrange commented Mar 30, 2020 •

edited

Loading