Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: dist-s390x-linux build went from 40min. to 160min with new LLVM pass manager #89609

Closed
hkratz opened this issue Oct 6, 2021 · 20 comments · Fixed by #94764
Closed

CI: dist-s390x-linux build went from 40min. to 160min with new LLVM pass manager #89609

hkratz opened this issue Oct 6, 2021 · 20 comments · Fixed by #94764
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-compiletime Issue: Problems and improvements with respect to compile times. O-SystemZ Target: SystemZ processors (s390x) P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@hkratz
Copy link
Contributor

hkratz commented Oct 6, 2021

CI time quadrupled when the new pass manager was enabled:

Right now that does not really matter much because Github's Apple CI runners use very old hardware and take about the same time, but once that is fixed this will block a faster CI. No idea if that platform has any significant userbase.

Maybe disable the new PM for that platform or report it upstream?

@workingjubilee
Copy link
Member

This regression absolutely needs to be reported upstream, as LLVM dev talk has mentioned completely removing the old pass manager within 1~2 versions.

@hkratz
Copy link
Contributor Author

hkratz commented Oct 6, 2021

Most time is apparently being spent compiling rustc_ast_lowering:

Building stage1 compiler artifacts (x86_64-unknown-linux-gnu -> s390x-unknown-linux-gnu)
...
[RUSTC-TIMING] rustc_ast_lowering test:false 6556.620

https://github.com/rust-lang-ci/rust/runs/3708445190?check_suite_focus=true#step:26:12286

This crate usually takes less than 30 seconds to build.

@cuviper cuviper added O-SystemZ Target: SystemZ processors (s390x) I-compiletime Issue: Problems and improvements with respect to compile times. A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 6, 2021
@hkratz
Copy link
Contributor Author

hkratz commented Oct 7, 2021

The build timeout in #88379 (comment) might be caused by this as well.

@estebank
Copy link
Contributor

estebank commented Oct 7, 2021

Most time is apparently being spent compiling rustc_ast_lowering

That is very surprising and am wondering what we could be doing for that not to be O(n).

@Mark-Simulacrum
Copy link
Member

cc @rust-lang/wg-llvm

I am a little tempted to suggest a patch that passes -Zno-new-pass-manager or w/e on s390x, but that seems pretty unfortunate as well, given the relatively quick deprecation timeline.

@camelid
Copy link
Member

camelid commented Oct 7, 2021

Is there anything I can do to unblock #88379? Or will it require the patch that @Mark-Simulacrum is suggesting?

@camelid
Copy link
Member

camelid commented Oct 7, 2021

Just checking, has the regression been reported upstream yet? I didn't see a report when I searched the LLVM Bugzilla. (I don't have an LLVM Bugzilla account, so I can't report it.)

@cuviper
Copy link
Member

cuviper commented Oct 7, 2021

I'm going to try to find the culprit, but in the meantime to unblock folks that hack could be an extra target condition in should_use_new_llvm_pass_manager. Note that the host arch is apparently not the issue, since these are all running on x86_64, cross-compiling to various targets.

@cuviper
Copy link
Member

cuviper commented Oct 8, 2021

I have the dist-s390x-linux docker build running locally, with LLVM debuginfo, and perf top shows a couple hotspots.

First it was sitting here at about 31% of the samples:

llvm::Instruction* const* llvm::SmallVectorTemplateCommon<llvm::Instruction*, void>::reserveForParamAndGetAddressImpl<llvm::SmallVectorTemplateBase<llvm::Instruction*, true> >(llvm::SmallVectorTemplateBase<llvm::Instruction*, true>*, llvm::Instruction* const&, unsigned long) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:217
 (inlined by) llvm::SmallVectorTemplateBase<llvm::Instruction*, true>::reserveForParamAndGetAddress(llvm::Instruction*&, unsigned long) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:522
 (inlined by) llvm::SmallVectorTemplateBase<llvm::Instruction*, true>::push_back(llvm::Instruction*) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:547
 (inlined by) PushDefUseChildren at /checkout/src/llvm-project/llvm/lib/Analysis/ScalarEvolution.cpp:4382
 (inlined by) llvm::ScalarEvolution::forgetLoop(llvm::Loop const*) at /checkout/src/llvm-project/llvm/lib/Analysis/ScalarEvolution.cpp:7487

Then it moved away from that to a new hotspot at 28%:

llvm::ilist_detail::node_options<llvm::Instruction, false, false, void>::const_pointer llvm::ilist_detail::NodeAccess::getValuePtr<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void> >(llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void> > const*) at /checkout/src/llvm-project/llvm/include/llvm/ADT/ilist_node.h:184
 (inlined by) llvm::ilist_detail::SpecificNodeAccess<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void> >::getValuePtr(llvm::ilist_node_impl<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void> > const*) at /checkout/src/llvm-project/llvm/include/llvm/ADT/ilist_node.h:229
 (inlined by) llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void>, true, true>::operator*() const at /checkout/src/llvm-project/llvm/include/llvm/ADT/ilist_iterator.h:139
 (inlined by) llvm::simple_ilist<llvm::Instruction>::back() const at /checkout/src/llvm-project/llvm/include/llvm/ADT/simple_ilist.h:141
 (inlined by) llvm::BasicBlock::getTerminator() const at /checkout/src/llvm-project/llvm/lib/IR/BasicBlock.cpp:151

Next it moved to this at 49%:

llvm::MachineBasicBlock** std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_algobase.h:340
 (inlined by) llvm::MachineBasicBlock** std::__copy_move_a<false, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_algobase.h:402
 (inlined by) llvm::MachineBasicBlock** std::__copy_move_a2<false, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_algobase.h:440
 (inlined by) llvm::MachineBasicBlock** std::copy<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_algobase.h:472
 (inlined by) llvm::MachineBasicBlock** std::__uninitialized_copy<true>::__uninit_copy<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_uninitialized.h:93
 (inlined by) llvm::MachineBasicBlock** std::uninitialized_copy<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /usr/include/c++/5/bits/stl_uninitialized.h:126
 (inlined by) void llvm::SmallVectorTemplateBase<llvm::MachineBasicBlock*, true>::uninitialized_copy<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, llvm::MachineBasicBlock**) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:490
 (inlined by) void llvm::SmallVectorImpl<llvm::MachineBasicBlock*>::append<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, void>(std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:652
 (inlined by) llvm::MachineBasicBlock** llvm::SmallVectorImpl<llvm::MachineBasicBlock*>::insert<std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, void>(llvm::MachineBasicBlock**, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >, std::reverse_iterator<__gnu_cxx::__normal_iterator<llvm::MachineBasicBlock**, std::vector<llvm::MachineBasicBlock*, std::allocator<llvm::MachineBasicBlock*> > > >) at /checkout/src/llvm-project/llvm/include/llvm/ADT/SmallVector.h:851
 (inlined by) llvm::LiveVariables::MarkVirtRegAliveInBlock(llvm::LiveVariables::VarInfo&, llvm::MachineBasicBlock*, llvm::MachineBasicBlock*, llvm::SmallVectorImpl<llvm::MachineBasicBlock*>&) at /checkout/src/llvm-project/llvm/lib/CodeGen/LiveVariables.cpp:112

It's still going, but I'm not sure this is helpful enough to keep watching this way...

@cuviper
Copy link
Member

cuviper commented Oct 8, 2021

See also #89524, which is not s390x specific.

@camelid
Copy link
Member

camelid commented Oct 8, 2021

Based on those results, it looks to me (albeit as someone who's not knowledgeable about LLVM internals) like maybe it has some really large data structures that are slow to process. Could that be it?

@hkratz
Copy link
Contributor Author

hkratz commented Oct 8, 2021

I have just tried DEPLOY=1 src/ci/docker/run.sh dist-s390x-linux for #88379 with the new pass manager disabled and it built in about 30 minutes locally. So this change is definitely blocked by this issue.

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this issue Oct 8, 2021
…anager_on_s390x_take_two, r=nagisa

Default to disabling the new pass manager for the s390x arch targets.

This hack disables the new LLVM pass manager by default for s390x arch targets until the performance issues are fixed (see rust-lang#89609). The command line option `-Z new-llvm-pass-manager=(yes|no)` continues to take precedence over this default.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 8, 2021
…ager_on_s390x_take_two, r=nagisa

Default to disabling the new pass manager for the s390x arch targets.

This hack disables the new LLVM pass manager by default for s390x arch targets until the performance issues are fixed (see rust-lang#89609). The command line option `-Z new-llvm-pass-manager=(yes|no)` continues to take precedence over this default.
@cuviper
Copy link
Member

cuviper commented Oct 9, 2021

I managed to collect -Csave-temps for the troublesome crate:

13:33:31.205643159  2.5M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.no-opt.bc
13:33:43.311381229  1.6M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-input.bc
13:33:43.578375452  1.6M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-rename.bc
13:33:43.811370411  1.6M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-resolve.bc
13:33:43.977366820  1.6M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-internalize.bc
13:33:44.487355785  2.0M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-import.bc
13:33:44.682351566  2.0M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-patch.bc
13:48:17.348469820   33M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-pm.bc
13:48:23.212342944   33M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.bc
14:45:58.202304051   11M  rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.o

So it took almost 15 minutes from thin-lto-after-patch to thin-lto-after-pm (running the pass manager), growing from 2 to 33 MB of LLVM bitcode. Then it took almost an hour to actually codegen that into an object file. At one point during that hour, I noticed the process was using 24GB of resident memory!

$ wc -lL *.ll
   157863      2306 rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-patch.ll
  4056416    238276 rustc_ast_lowering-7fc619e50cfb03b2.rustc_ast_lowering.797ce196-cgu.0.rcgu.thin-lto-after-pm.ll

In that 4 million lines of LLVM IR, about 2.7 million is just in 2 instantiations of LoweringContext::lower_opaque_impl_trait! That longest line of 238276 characters is the phi line below:

106767:                                           ; preds = %754032
  %106768 = landingpad { i8*, i32 }
          cleanup
  br label %106769

106769:                                           ; preds = %106767, %106765, ...etc.
  %106770 = phi { i8*, i32 } [ %86006, %86005 ], [ %86008, %86007 ], ...etc.
  invoke fastcc void @"_ZN4core3ptr101drop_in_place$LT$rustc_ast_lowering..lifetimes_from_impl_trait_bounds..ImplTraitLifetimeCollector$GT$17hde8a472161ebd31bE"(%"lifetimes_from_impl_trait_bounds::ImplTraitLifetimeCollector"* nonnull %15650) #25
          to label %754701 unwind label %754477

I cut them off with "etc.", but the full preds and phi lists are each 11574 entries long! AFAICS, all of those predecessors are simple landingpad blocks like the one I included here. The second lower_opaque_impl_trait has the same thing but only 238274 characters long, I think just from a couple shorter %N number lengths.

The same blocks in thin-lto-after-patch (before pm) look like this:

243:                                              ; preds = %259
  %244 = landingpad { i8*, i32 }
          cleanup
  br label %247

245:                                              ; preds = %227
  %246 = landingpad { i8*, i32 }
          cleanup
  br label %247

247:                                              ; preds = %245, %243
  %248 = phi { i8*, i32 } [ %244, %243 ], [ %246, %245 ]
  invoke fastcc void @"_ZN4core3ptr101drop_in_place$LT$rustc_ast_lowering..lifetimes_from_impl_trait_bounds..ImplTraitLifetimeCollector$GT$17hde8a472161ebd31bE"(%"lifetimes_from_impl_trait_bounds::ImplTraitLifetimeCollector"* nonnull %30) #33
          to label %749 unwind label %570

@cuviper
Copy link
Member

cuviper commented Oct 9, 2021

Here's the input bitcode before the pass manager: rustc_ast_lowering-cgu.0.rcgu.thin-lto-after-patch.bc.gz

Even opt -O1 behaves badly, while opt -O1 --enable-new-pm=0 runs quickly without blowing up the bitcode.

@cuviper
Copy link
Member

cuviper commented Oct 9, 2021

cc @nikic -- this is another new-pm problem, although it appears to be specific to SystemZ.

@workingjubilee workingjubilee added the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Oct 10, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Oct 10, 2021
@cuviper
Copy link
Member

cuviper commented Oct 12, 2021

Filed: https://bugs.llvm.org/show_bug.cgi?id=52146
(now migrated to llvm/llvm-project#51488)

@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Oct 14, 2021
@cuviper cuviper self-assigned this Oct 21, 2021
@cuviper
Copy link
Member

cuviper commented Oct 21, 2021

Status: #89666 disabled newPM when targeting s390x, but this is a temporary solution because oldPM will be removed eventually. The underlying issue does seem to be excessive inlining, similar to #89524 and others from newPM.

@nikic
Copy link
Contributor

nikic commented Jan 7, 2022

If anyone is wondering why this one only occurs on SystemZ: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h#L39 Apparently this target just increases all inlining threshold by a factor of three...

@cuviper
Copy link
Member

cuviper commented Jan 7, 2022

Aha, thanks! The only other targets that change that multiplier are NVPTX (5) and AMDGPU (11).

AMD: See, NVIDIA only goes to 5 -- This one goes to 11! 🤘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-compiletime Issue: Problems and improvements with respect to compile times. O-SystemZ Target: SystemZ processors (s390x) P-high High priority regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants