Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmparser: Performance regressions since v0.100.2 #1701

Open
Robbepop opened this issue Jul 26, 2024 · 12 comments
Open

wasmparser: Performance regressions since v0.100.2 #1701

Robbepop opened this issue Jul 26, 2024 · 12 comments

Comments

@Robbepop
Copy link
Contributor

Robbepop commented Jul 26, 2024

Due to technical reasons I had to use the (by now) ancient wasmparser 0.100.2 in Wasmi and was using a fork called wasmparser-nostd. The fork had no significant impact on the performance.

Recently I updated Wasmi to use the newest wasmparser v0.214.0 and unfortunately saw massive performance regressions, especially for Wasm validation. Many Wasmi benchmarks regressed by 15-35% and some even by 45% compared to the Wasmi that uses the ancient wasmparser v0.100.2.

The Wasmi upgrade PR is here: wasmi-labs/wasmi#1141

Maybe I did some mistakes when upgrading which explain the performance regressions.

Since Wasmi is very focused on speedy translation times those performance regressions unfortunately are a show stopper for using the new wasmparser versions.

@alexcrichton
Copy link
Member

Do you have a link to the module with the largest regression? That'd probably be the easiest to start with in terms of evaluating validation performance on that module and seeing what sticks out in perf

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 26, 2024

Do you mean Wasm module?

From what I saw tiny_keccak is a good contender. But there were others.
To benchmark tiny_keccak translation test cases in Wasmi, use:

cargo bench --bench benches translate/tiny_keccak/

The tiny_keccak benchmark can be found here:
https://github.com/wasmi-labs/rust-benchmarks/tree/main/cases/tiny_keccak

Update

I just found one reason why it regressed: I benchmarked without setting the no-hash-maps feature. Generally BTreeMap is faster than HashMap for Wasmi uses and the old wasmparser-nostd fork alsways uses BTreeMaps. With this feature set local benchmarks still indicate regressions but they are less severe, ranging between 10-25% instead of 15-35%. The unchecked (no validation) case now even has slightly improved performance which indicates very strongly to me that the regressions that we see come from the Wasm validation.

image

@alexcrichton
Copy link
Member

Ok I've been poking at this a bit and so far what I've found is that #701, the first commit after 0.100.0, led to a precipitous drop in performance. That was known at the time however. There are other smaller regressions here and there but they're looking to be more difficult to pin down.

Profiling the code itself does not show any unexpected hot spots. That means that it's just sort of spread out a lot.

Doing various tweaks seems to help here and there but nothing is a major improvement. For example outlining string-based errors to their own #[cold] function sometimes helps and sometimes doesn't.

One improvement that has shown promise is skipping initialization checking for locals entirely. This was first added in the function-references proposal and later proposals like gc rely on it as well. This yields a ~5% gain (ish) but requires completely removing the check. That means that only way to achieve this would be to add compile-time features to wasmparser.

Overall my guess is that the wasm language has just gotten more complicated over time which has slowed down the validator. I don't know of a great way to handle this other than to add a bunch of Cargo compile-time features for each wasm proposal. Doing that is not something I think would be easy to design but I think would provide the wins necessary here.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 26, 2024

Thank you for the investigation @alexcrichton !

If your assessment is correct I have to admit that it is a bit worrysome to me that the Wasm spec seems to develop into a direction that may not be suitable for all of its original users and stake holders.

I remember that you said that function-references are going to incur a performance overhead and in hindsight I was kinda optimistic that this performance penalty will surely be dealt with over time.

My assumption of how Wasm features are intended to work was that they should (ideally) not significantly influence the performance if they are unused. So only if I actually were to use the function-references proposal in a significant way in my Wasm blob, I'd "suffer" from the performance penalties that are associated to its additional complexity. Though, I totally understand that it is probably already hard enough to keep up with all the Wasm developments.

Conditional compilation wouldn't solve the problem from my point of view since eventually all Wasm runtimes (including Wasmi) want to adopt all stabilized features anyways. The Wasm standard is not modularised at the moment and thus does not allow to be picky.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 27, 2024

I just ran flamegraph on the translate/tiny_keccak/checked/lazy-translation test case:

main:
image

PR:
image

From what it looks I can confirm your assessment that multiple parts of the wasmparser validation pipeline seem to have regressed performance. However, especially notable parts (from the looks) are:

  • Validator::type_section
  • ModuleParser::process_end
  • VisitOperator::visit_local_get
  • VisitOperator::visit_end

One improvement that has shown promise is skipping initialization checking for locals entirely.

I forgot to ask last time: What exactly does this mean? What is initialization checking of locals and where can I find the code?

Looking just at translate_operators we can also see some significant shifts:
main:
image
PR:
image

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 29, 2024

In my last performance assessment I made a mistake by not also enabling no-hash-maps for the main branch when conducting benchmarks between main and the PR which made the results look better for the PR than they actually are. The corrected results look even worse:

We see a performance regression between 15-45%. Still, unchecked workloads without Wasm validation are not regressed.

image

@alexcrichton
Copy link
Member

I would be hesitant to generalize wasm-tools's implementation to the entire spec development and other engines in the sense that we could just perhaps be bitten by our own decisions. This could be wasmparser-specific and not affect other implementations, I'm not sure. When I've benchmarked historically, for example, I believe that v8's validator (via node) has consistently been faster than wasmparser's. Basically I don't think wasmparser was ever at the peak of validation performance.

I also saw similar hot spots as you locally, but I was unable to figure out how best to optimize them and local attempts weren't bearing much fruit.

The initialization checking looks like this. That's an array lookup plus a boolean test of what's in the array. My guess is that everything is a well-predicted branch and in-cache since locals typically go to the same local. The 5% perf win I got (ish) was removing those checks entirely. That's not spec-compliant so we can't land it but it was the best I could do.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jul 29, 2024

You are right that this might very well be just an implementation detail that can be fixed.

Thanks a lot for the link to the local.get checks, I will have a look at the codebase to see whether it is possible to further apply some optimizations without removing those checks. E.g. I could imagine that we could use in-place bitfields for up to 32 (or 64) locals (common case) instead of Vec<bool> etc.. Maybe there are more things that could be improved. I will experiment with this a bit and come back.

@eqrion
Copy link
Collaborator

eqrion commented Jul 29, 2024

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352
[2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

@Robbepop
Copy link
Contributor Author

If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].

So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).

[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352 [2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285

Thanks so much for sharing this information! That's invaluable. :)

@alexcrichton
Copy link
Member

Indeed thanks @eqrion! I suspect that both SpiderMonkey (and possibly v8 too) would be good sources of inspiration for algorithmic changes to the validator in wasmparser.

Around locals specifically another thing that wasmparser does is it doesn't maintain Vec<ValType> for locals and instead has a scheme where the first N locals are stored in Vec<ValType> but afterwards they're stored in Vec<(u32, ValType)> for a more compressed representation like wasm has. This representation is probably too-clever-by-half and probably isn't really benefitting all that much. Simplifying that to just Vec<ValType> would probably help clean up the code a bit and remove a branch here and there.

Another possibility is that there is currently zero unsafe code in the validator which while is a great property I don't personally consider set in stone forever necessarily. For example using the idea from SpiderMonkey checking for local initialization the first index of a non-nullable local could be the number of locals in a function if they're all nullable. That way after that "bounds check" the access into the actual types array could be done in an unchecked fashion to have just a single condition in the validation of local.get. This could also very well be "too clever by half" and introducing unsafe code is definitely something we'd want to do carefully.

@Robbepop
Copy link
Contributor Author

Robbepop commented Oct 6, 2024

I updated the Wasmi PR to update wasmparser to v0.218.0 and improved/extended translation benchmarks and reran them to get a better understanding of where performance regressions are located and could be fixed.

Despite the performance regressions I consider merging the PR to finally make use of the newer wasmparser versions but I am keen on fixing the performance regressions, especially for the important lazy modes.

Link: wasmi-labs/wasmi#1141 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants