Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Allow for RLIBs to only contain metadata. #48373

Closed

Conversation

michaelwoerister
Copy link
Member

This is an experimental implementation of #38913. I want to try and see if I can get it through a perf.rlo run. No real need to review this yet.

@rust-highfive
Copy link
Collaborator

r? @cramertj

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 20, 2018
@michaelwoerister
Copy link
Member Author

@bors try

@bors
Copy link
Contributor

bors commented Feb 20, 2018

⌛ Trying commit 0c034ad50096b3fcb2d1c29aa538ef5ad22d6603 with merge 1f12186df3cc7f2c3af0b55b9089a7b22fead331...

@pietroalbini pietroalbini added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2018
@bors
Copy link
Contributor

bors commented Feb 20, 2018

☀️ Test successful - status-travis
State: approved= try=True

@Mark-Simulacrum
Copy link
Member

er, well, not entirely sure this is even somewhat meaningful, but... http://perf.rust-lang.org/compare.html?start=27a046e9338fb0455c33b13e8fe28da78212dedc&end=1f12186df3cc7f2c3af0b55b9089a7b22fead331&stat=instructions%3Au

@michaelwoerister
Copy link
Member Author

Alright, interesting :)

I think this mainly shows that we are not measuring quite the right thing with our benchmarks. MIR-only RLIBs postpone any LLVM work to until it is actually needed. In many of the benchmarks, we are only compiling RLIBs though, so we never need to generate code. However, that's not a realistic use case, I think. If you are compiling just for the diagnostics, cargo check is the way to go and should even be a little faster than MIR-only RLIBs. If our benchmarks would include compiling tests, executables, or anything other than an RLIB, the results would probably look very different.

I'll try to gather some more useful data.

@michaelwoerister
Copy link
Member Author

So, I ran some benchmarks and the numbers are not all that encouraging, at least for now. While we generate a lot less code overall, it seems that the single-threaded trans phase in the final crate is bottlenecking us so much that compilation actually takes longer in many cases.

Here are some of the numbers I collected. The table shows the aggregate time spent for various tasks while compiling the whole crate graph. In many cases we do less work overall but due to worse parallelization, wall-clock time increases. The last table shows the number of functions we instantiate for the whole crate graph. It becomes substantially smaller in most cases. Maybe we should look into re-using generic instances separately if MIR-only RLIBs are blocked on parallelizing trans.

ripgrep - cargo build

regular MIR-only %
LLVM codegen passes 33.90 32.52 95.9 %
LLVM function passes 1.39 1.35 97.5 %
LLVM module passes 2.18 1.95 89.8 %
MonoItem collection 2.80 2.09 74.4 %
translation 23.73 19.97 84.1 %
LLVM total 37.46 35.83 95.6 %
BUILD total 20.92 26.14 125.0 %

encoding-rs - cargo test --no-run

regular MIR-only %
LLVM codegen passes 13.11 7.28 55.6 %
LLVM function passes 0.57 0.33 58.1 %
LLVM module passes 0.90 0.44 48.7 %
MonoItem collection 1.19 0.69 58.1 %
translation 8.68 6.08 70.1 %
LLVM total 14.59 8.06 55.2 %
BUILD total 15.73 14.37 91.4 %

webrender - cargo build

regular MIR-only %
LLVM codegen passes 109.42 69.17 63.2 %
LLVM function passes 4.55 3.10 68.2 %
LLVM module passes 1.63 1.06 64.7 %
MonoItem collection 10.70 5.64 52.7 %
translation 102.95 58.70 57.0 %
LLVM total 115.60 73.33 63.4 %
BUILD total 72.30 68.64 94.9 %

futures-rs - cargo test --no-run

regular MIR-only %
LLVM codegen passes 41.19 48.67 118.1 %
LLVM function passes 1.68 1.93 115.0 %
LLVM module passes 0.21 0.22 107.3 %
MonoItem collection 5.86 6.90 117.8 %
translation 55.48 69.84 125.9 %
LLVM total 43.08 50.82 118.0 %
BUILD total 17.28 19.18 111.0 %

tokio-webpush-simple - cargo build

regular MIR-only %
LLVM codegen passes 33.98 22.55 66.4 %
LLVM function passes 1.53 0.95 62.0 %
LLVM module passes 0.30 0.20 66.8 %
MonoItem collection 3.80 2.11 55.5 %
translation 39.09 22.99 58.8 %
LLVM total 35.81 23.70 66.2 %
BUILD total 22.28 21.54 96.7 %

Number of LLVM function definitions generated for whole crate graph

MIR-only regular
ripgrep 22683 34239
encoding-rs test 8393 15116
webrender 72238 114239
futures-rs test 57565 46935
tokio-webpush-simple 27346 44961

cc @rust-lang/compiler

@hanna-kruppe
Copy link
Contributor

it seems that the single-threaded trans phase in the final crate is bottlenecking us so much that compilation actually takes longer in many cases.

Huh? LLVM IR generation and optimization is parallelized with multiple CGUs (and indeed the times you report are generally smaller for MIR-only rlibs). Are you referring to the steps before that, like translation item collection, CGU partitioning, etc.?

@nikomatsakis
Copy link
Contributor

@rkruppe I think @michaelwoerister is referring to the rustc code for generating the LLVM IR, not the LLVM execution.

@nikomatsakis
Copy link
Contributor

Seems like a good reason to push harder on parallel rustc =)

@nikomatsakis
Copy link
Contributor

@michaelwoerister

Do you have any explanation for this?

futures-rs test | 57565 | 46935

Why would it make more with mir-only rlibs?

@hanna-kruppe
Copy link
Contributor

@nikomatsakis Oh! I was under the impression that's parallelized as well. Is it not?

@japaric japaric mentioned this pull request Feb 22, 2018
8 tasks
@michaelwoerister
Copy link
Member Author

Seems like a good reason to push harder on parallel rustc =)

Yes.

Do you have any explanation for this?

futures-rs test | 57565 | 46935

Why would it make more with mir-only rlibs?

My theory is that it's because futures-rs has lots of tests, each of which gets compiled into its own executable. So while we get more sharing within each executable, we lose out on sharing between executables. At a certain number of leaf crates not sharing non-generics outweighs the improved sharing of generic instances. I did not really verify this hypothesis though.

@michaelwoerister
Copy link
Member Author

Oh! I was under the impression that's parallelized as well. Is it not?

It's not, unfortunately. Translation reads the tcx, which cannot be accessed from multiple threads.

@retep998
Copy link
Member

My theory is that it's because futures-rs has lots of tests, each of which gets compiled into its own executable. So while we get more sharing within each executable, we lose out on sharing between executables. At a certain number of leaf crates not sharing non-generics outweighs the improved sharing of generic instances. I did not really verify this hypothesis though.

So what you're saying is we need a workspace wide incremental codegen cache? So if one leaf does codegen, it gets cached and other leaf crates can reuse the existing work?

@michaelwoerister
Copy link
Member Author

So what you're saying is we need a workspace wide incremental codegen cache? So if one leaf does codegen, it gets cached and other leaf crates can reuse the existing work?

I'm not sure if we could make this work. Maybe. It would solve part of the problem.

@bors
Copy link
Contributor

bors commented Feb 27, 2018

☔ The latest upstream changes (presumably #48449) made this pull request unmergeable. Please resolve the merge conflicts.

@michaelwoerister
Copy link
Member Author

Posted the results of this experiment in #38913. Closing.

bors added a commit that referenced this pull request Mar 6, 2018
Don't recompute SymbolExportLevel for upstream crates.

The data collected in #48373 suggests that we can avoid generating up to 30% of the LLVM definitions by only instantiating function monomorphizations once with a given crate graph. Some more data, collected with a [proof-of-concept implementation](https://github.com/michaelwoerister/rust/commits/share-generics) of re-using monomorphizations, which is less efficient than the MIR-only RLIB approach, suggests that it's still around 25% LLVM definitions that we can save.

So far, this PR only cleans up handling of symbol export status. Too early to review still.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants