Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on a policy for Rayon usage #4952

Open
michaelsproul opened this issue Nov 27, 2023 · 2 comments
Open

Decide on a policy for Rayon usage #4952

michaelsproul opened this issue Nov 27, 2023 · 2 comments
Labels
bug Something isn't working code-quality RFC Request for comment

Comments

@michaelsproul
Copy link
Member

michaelsproul commented Nov 27, 2023

Description

I've spent the last few days of downtime after Devconnect mulling over how to deal with Rayon and deadlocks (rayon-rs/rayon#592).

We are faced with an acute instance of this problem in #4507 (comment).

Background

Rayon can deadlock if both of the following conditions are true:

  1. Lock L is held while calling into Rayon (e.g. calling rayon::join while holding a mutex).
  2. Lock L is accessed from within Rayon tasks (e.g. calling mutex.lock() within par_iter).

Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.

Observations

  • Weakening either of the conditions is sufficient to avoid the deadlock. We can either avoid holding locks while calling into Rayon, or avoid obtaining locks from within Rayon tasks.
  • There is spooky action-at-a-distance. The deadlock is a property of lock usage throughout the entire codebase which makes it difficult to check manually.
  • We are currently only using Rayon in 2 places: in the op pool for attestation packing, and while tree hashing beacon states.

Strategies

Avoid Rayon

We could use Tokio (or another futures executor?) instead, as tokio::sync::Mutex is explicitly safe to hold across an .await. I had a go at benchmarking tree hashing with Tokio and it was around 3.5x slower (sigp/milhouse#31). We probably don't want to switch tree-hashing out, but could switch out some of the cases where we use Rayon and aren't as driven by performance.

Problematically, Tokio doesn't like it when we spawn compute-heavy tasks on the main executor, so we may need a separate async executor for compute-heavy tasks. Using Tokio's blocking threads is not desirable as then we can't use tokio::sync::Mutex.

Static Analysis

I've opened an issue on the lockbud tool to detect Rayon-related deadlocks (BurtonQin/lockbud#55). The author of lockbud previously used it to find double-locking bugs in Lighthouse, so it would be cool to support the tool, contribute to its development, and run it on CI. Even if we do detect deadlocks using the tool, we still need one of the other strategies to mitigate them.

Avoid Locks

Rayon is only dangerous when locks are involved, so we could refactor code to avoid using locks inside Rayon tasks. This is potentially difficult to do comprehensively without a static analyser, as it can be unclear whether some function is running under Rayon, or may call into Rayon. For tree-hashing it is easy to guarantee that we don't obtain any non-local locks, so it is more obviously safe. In the op pool, we are currently faced with a mess of deeply nested function calls and interactions with the fork choice lock, which are more difficult to eyeball. If we had a persistent and cheap-to-copy data structure for fork choice, then the op pool could use a copy rather than locking.

@michaelsproul michaelsproul added bug Something isn't working RFC Request for comment code-quality labels Nov 27, 2023
@sauliusgrigaitis
Copy link

Rayon is great for initial parallelization, but it's hard when you hit it's limitations. We are moving away from Rayon at Grandine, but we still use it in some places. A separate Tokio thread pool for CPU intensive tasks is a great option.

@winksaville
Copy link
Contributor

I personally like using channels/messaging to facilitate parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working code-quality RFC Request for comment
Projects
None yet
Development

No branches or pull requests

3 participants