Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: Introduce a pool of query planners #4897

Merged
merged 32 commits into from
Apr 9, 2024

Conversation

o0Ignition0o
Copy link
Contributor

@o0Ignition0o o0Ignition0o commented Apr 3, 2024

Experimental: Introduce a pool of query planners (PR #4897)

The router supports a new experimental feature: a pool of query planners to parallelize query planning.

You can configure query planner pools with the supergraph.query_planner.experimental_available_parallelism option:

supergraph:
  query_planner:
    experimental_parallelism: auto # number of available cpus

Its value is the number of query planners that run in parallel, and its default value is 1. You can set it to the special value auto to automatically set it equal to the number of available CPUs.

You can discuss and comment about query planner pools in this GitHub discussion.

By @xuorig and @o0Ignition0o in #4897

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Apr 3, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

o0Ignition0o and others added 16 commits April 3, 2024 14:20
This changes adresses contention we didn't see before the pool of planner was introduced. Borrow semantics and locks make for a surprising pattern where a lock is held a bit too long.

This changeset adresses it, and we expect a performance boost at stale / under heavy load.
*Description here*

Fixes #**issue_number**

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

Co-authored-by: Marc-Andre Giroux <mgiroux@netflix.com>
*Description here*

Fixes #**issue_number**

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

---------

Co-authored-by: Marc-Andre Giroux <mgiroux@netflix.com>
@o0Ignition0o o0Ignition0o changed the title Igni/query planner pool Experimental: Introduce a pool of query planners Apr 4, 2024
o0Ignition0o and others added 2 commits April 5, 2024 09:58
Co-authored-by: Edward Huang <edward.huang@apollographql.com>
@o0Ignition0o o0Ignition0o marked this pull request as ready for review April 5, 2024 08:45
Copy link
Contributor

@Geal Geal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you mention it in the docs?
I am not sure auto will be the best option here. Tokio already spawns threads according to the available capacity, so if at the same time there are as many planner threads as there are cores, then we risk not having any capacity left to handle requests because the planners do blocking work

Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be different to the approach of using web workers that you were discussing earlier in the week. Was there a problem with that approach? I'm asking because I'm curious, but I think this is a better approach to that anyway.

One other thing worth asking. Instead of managing a queue using async_channel, maybe put some kind of adaptive, load-shedding queuing model in front of the pool? e.g.: https://crates.io/crates/little-loadshedder and expose the configuration so that user's can express a queue wait time in configuration?

apollo-router/src/error.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/cache/entity.rs Show resolved Hide resolved
@o0Ignition0o
Copy link
Contributor Author

This seems to be different to the approach of using web workers that you were discussing earlier in the week. Was there a problem with that approach? I'm asking because I'm curious, but I think this is a better approach to that anyway.

It turns out both approaches are equivalent in term of runtime capabilities. Thankfully a v8 runtime is initialized in a static :D

The good news is we should be able to drop in any planner implementation in the future.

One other thing worth asking. Instead of managing a queue using async_channel, maybe put some kind of adaptive, load-shedding queuing model in front of the pool? e.g.: https://crates.io/crates/little-loadshedder and expose the configuration so that user's can express a queue wait time in configuration?

This could be worth considering as a followup. I'm fairly happy with the MPMC approach since workers decide to pick new jobs as soon as they're ready to deal with them.

@o0Ignition0o o0Ignition0o enabled auto-merge (squash) April 8, 2024 12:48
Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a config metric.

.changesets/exp_carton_ginger_magnet_beacon.md Outdated Show resolved Hide resolved
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone else see this warning:

garypen@Garys-MacBook-Pro router % cargo check                         
warning: /Users/garypen/dev/router/apollo-router/Cargo.toml: file `/Users/garypen/dev/router/apollo-router/benches/planner.rs` found to be present in multiple build targets:
  * `example` target `planner`
  * `bench` target `planner`
<etc...>

Maybe address this? I don't know if it really matters, but ...

@o0Ignition0o
Copy link
Contributor Author

@shorgi shorgi mentioned this pull request Apr 8, 2024
@garypen
Copy link
Contributor

garypen commented Apr 9, 2024

@garypen this is odd, I don't see several targets in the Cargo.toml:

https://github.com/apollographql/router/pull/4897/files#diff-aca654efc6c22bebf4bd167370ab3bf380f3e086befe3d7c6761a8f7eb59d89c

The warning is removed if you add:

autobenches = false

to the package section in apollo-router/Cargo.toml. Target Autodiscovery is the cause of the problem.

I think you need to decide if this is the behaviour you want wrt other benches/examples etc...

@o0Ignition0o
Copy link
Contributor Author

@garypen great catch! fixed in d492701

@o0Ignition0o o0Ignition0o enabled auto-merge (squash) April 9, 2024 13:33
@o0Ignition0o o0Ignition0o merged commit 99824bf into dev Apr 9, 2024
13 of 14 checks passed
@o0Ignition0o o0Ignition0o deleted the igni/query_planner_pool branch April 9, 2024 13:42
o0Ignition0o pushed a commit that referenced this pull request Apr 16, 2024
Docs for query planner pool
(#4897)
@abernix abernix mentioned this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants