Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX512 code generated for i32 array sum is worse than code by clang 5 #48287

Closed
Djuffin opened this issue Feb 17, 2018 · 6 comments
Closed

AVX512 code generated for i32 array sum is worse than code by clang 5 #48287

Djuffin opened this issue Feb 17, 2018 · 6 comments
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Djuffin
Copy link

Djuffin commented Feb 17, 2018

Demo: https://godbolt.org/g/vqB6oj

I tried this code:

pub  struct v {
    val:[i32;16]
}
pub fn test(a:v, b:v) -> v {
    let mut res = v { val : [0;16] };

    for i in 0..16 {
        res.val[i] = a.val[i] + b.val[i];
    }
    return res;
}

Compiled it with
rustc --crate-type=lib -C opt-level=3 -C target-cpu=skylake-avx512 --emit asm test.rs

I expected to see this happen:

  vmovdqu32 zmm0, zmmword ptr [rsp + 72]
  vpaddd zmm0, zmm0, zmmword ptr [rsp + 8]
  vmovdqu32 zmmword ptr [rdi], zmm0
  mov rax, rdi
  vzeroupper
  ret

Instead, this happened:

	movq	$0, 56(%rsp)
	vmovdqu	(%rdx), %ymm0
	vpaddd	(%rsi), %ymm0, %ymm0
	vmovdqu	%ymm0, (%rsp)
	movl	32(%rdx), %eax
	addl	32(%rsi), %eax
	movl	%eax, 32(%rsp)
	movl	36(%rdx), %eax
	addl	36(%rsi), %eax
	movl	%eax, 36(%rsp)
	movl	40(%rdx), %eax
	addl	40(%rsi), %eax
	movl	%eax, 40(%rsp)
	movl	44(%rdx), %eax
	addl	44(%rsi), %eax
	movl	%eax, 44(%rsp)
	movl	48(%rdx), %eax
	addl	48(%rsi), %eax
	movl	%eax, 48(%rsp)
	movl	52(%rdx), %eax
	addl	52(%rsi), %eax
	movl	%eax, 52(%rsp)
	movl	56(%rdx), %eax
	addl	56(%rsi), %eax
	movl	%eax, 56(%rsp)
	movl	60(%rdx), %eax
	addl	60(%rsi), %eax
	movl	%eax, 60(%rsp)
	vmovdqu	(%rsp), %ymm0
	vmovdqu	32(%rsp), %ymm1
	vmovdqu	%ymm1, 32(%rdi)
	vmovdqu	%ymm0, (%rdi)
	movq	%rdi, %rax
	addq	$64, %rsp
	retq

Meta

~$ rustc --version --verbose
rustc 1.24.0 (4d90ac38c 2018-02-12)
binary: rustc
commit-hash: 4d90ac38c0b61bb69470b61ea2cccea0df48d9e5
commit-date: 2018-02-12
host: x86_64-unknown-linux-gnu
release: 1.24.0
LLVM version: 4.0
@matthiaskrgr
Copy link
Member

Funny, when I change 16 to 17 in the rust code

pub  struct v {
    val:[i32;17]
}


pub fn test(a:v, b:v) -> v {
    let mut res = v { val : [0;17] };

    for i in 0..17 {
        res.val[i] = a.val[i] + b.val[i];
    }
    return res;
}

I get

example::test:
  push rbp
  mov rbp, rsp
  sub rsp, 72
  mov dword ptr [rbp - 8], 0
  mov qword ptr [rbp - 16], 0
  vmovdqu32 zmm0, zmmword ptr [rdx]
  vpaddd zmm0, zmm0, zmmword ptr [rsi]
  vmovdqu32 zmmword ptr [rbp - 72], zmm0
  mov eax, dword ptr [rdx + 64]
  add eax, dword ptr [rsi + 64]
  mov dword ptr [rbp - 8], eax
  mov dword ptr [rdi + 64], eax
  vmovdqu ymm0, ymmword ptr [rbp - 72]
  vmovdqu ymm1, ymmword ptr [rbp - 40]
  vmovdqu ymmword ptr [rdi + 32], ymm1
  vmovdqu ymmword ptr [rdi], ymm0
  mov rax, rdi
  add rsp, 72
  pop rbp
  ret

Is this closer to the clang instructions?

@nagisa
Copy link
Member

nagisa commented Feb 17, 2018

The referenced issue #48293 has a better explanation of what is happening.

@AronParker
Copy link
Contributor

I was just about to post this issue here, good thing someone else already did. Clang only produces this "good" code for C++, not for C. On reddit people came to the conclusion that this is due to copy elision (in particular return value optimization) that is done in C++, but apparently not in C and Rust.

@jonas-schievink
Copy link
Contributor

this is due to copy elision (in particular return value optimization)

In that case, #47954 might help, right?

@pietroalbini pietroalbini added I-slow Issue: Problems and improvements with respect to performance of generated code. C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-SIMD Area: SIMD (Single Instruction Multiple Data) labels Feb 20, 2018
@GodTamIt
Copy link

This no longer seems to be a problem with the latest versions of both rustc and clang: https://gcc.godbolt.org/z/c4187cno3

@nikic
Copy link
Contributor

nikic commented Feb 19, 2022

And looks like this has been the case for quite a while already, since 1.52.

Worth mentioning that LLVM intentionally does not use 512-bit vectors here by default.

@nikic nikic closed this as completed Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants