Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track execute() and enqueue() tasks separately from scheduled tasks. #2645

Merged

Conversation

gmilos
Copy link
Contributor

@gmilos gmilos commented Feb 12, 2024

Motivation:

SelectableEventLoop uses ScheduledTask to track tasks scheduled for some specific point in time, but also tasks that are supposed to be executed immediately.

This casuse performance issues where the number of immediate tasks grows, as they need to be kept sorted in the _scheduledTasks PriorityQueue. Additionally, that datastructure is protected by lock(s) which casues lock contention from other threads.

Finally, immediate tasks differ from future scheduled tasks in that they don't have a failFn to handle destruction at the end of SelectableEventLoop.run() execution.

Modifications:

  • Modify ScheduledTask to only track closure tasks and require failFn to be provided.
  • Retain UnderlyingTask for tracking closure or ErasedUnownedJob for async-await tasks.
  • Add SelectableEventLoop._immediateTasks to track immediate tasks (collection of UnderlyingTask-s).
  • Split production / consumption of the tasks to match the datastructure changes described above.
  • Modify the defer block of SelectableEventLoop.run() to execute immediate tasks, instead of dropping them silently.

Result:

Performance of workloads relying on queuing lots of tasks gets better.

@gmilos
Copy link
Contributor Author

gmilos commented Feb 12, 2024

The test failure is ... because the # of allocations for less than expected. By about 1k. That's a good thing.

17:35:51  ++ assert_greater_than 155002 156050
17:35:51  ++ [[ ! 155002 -gt 156050 ]]

And ... I see the fix just got merged: e4102ae#diff-660361849e6074a975e1787c5836a6138eef55096f2b80fb76d5da0af73b48d5R40

@gmilos gmilos requested a review from Lukasa February 12, 2024 18:57
@gmilos gmilos marked this pull request as ready for review February 12, 2024 18:57
@gmilos
Copy link
Contributor Author

gmilos commented Feb 12, 2024

@weissi this is the change we discussed in person.

@gmilos
Copy link
Contributor Author

gmilos commented Feb 12, 2024

@swift-server-bot test this please

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment in the diff

Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
// therefore the best course of action is to run them.
for task in immediateTasksCopy {
self.run(task)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly substantial behavioural change. Previously these tasks got dropped. Normally I’d say the logic here is good but we should prepare ourselves for a few subtle bugs popping out of this.

Copy link
Contributor Author

@gmilos gmilos Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, true. But without this, the application would have their Task-s/functions not executed at all. I don't think any behaviour can be worse than that. We haven't been noticing, because before Tasks this wasn't frequently exercised. But yes, there could be some unintended consequences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's a substantial change but it's definitely a good one. I think this will actually make tearing down an EventLoop much easier than it was before this patch. Previously we just couldn't implement this sensibly, now we can.

But of course, it'll likely trigger a few subtle changes here or there but I can't see this changing anything that was reliably working fine for the worse.

Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
@weissi
Copy link
Member

weissi commented Feb 12, 2024

@swift-nio-bot test perf please

@weissi
Copy link
Member

weissi commented Feb 12, 2024

@swift-nio-bot perf test please

@weissi
Copy link
Member

weissi commented Feb 12, 2024

@gmilos nice!

asyncwriter_single_writes_1M_times | 1.464787299 | 1.596832264 | current | -8%


8% better in one of the perf tests that actually uses async stuff. That's expected to be one of the few ones because most perf tests won't use execute. But that one will!


Weirdly, it claims that this one is 8% worse

schedule_and_run_100k_tasks | 0.252803345 | 0.233980813 | previous | 8%


@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@weissi how much signal is in the performance tests? Looks like there are a few that did regress, including:

lock_8_threads_10M_ops | 0.957658591 | 0.873560162 | previous | 9%

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

And another:

future_whenallcomplete_100k_deferred_on_loop | 0.084619949 | 0.08085791 | previous | 4%

There seems to be a pattern where the tests with large queues degraded in performance. I'll take a look.

Motivation:

`SelectableEventLoop` uses `ScheduledTask` to track tasks scheduled for some
specific point in time, but also tasks that are supposed to be executed immediately.

This casuse performance issues where the number of immediate tasks grows, as
they need to be kept sorted in the `_scheduledTasks` `PriorityQueue`. Additionally,
that datastructure is protected by lock(s) which casues lock contention from
other threads.

Finally, immediate tasks differ from future scheduled tasks in that they don't have
a `failFn` to handle destruction at the end of `SelectableEventLoop.run()` execution.

Modifications:

* Modify `ScheduledTask` to only track closure tasks and require `failFn` to be provided.
* Retain `UnderlyingTask` for tracking closure or `ErasedUnownedJob` for async-await tasks.
* Add `SelectableEventLoop._immediateTasks` to track immediate tasks (collection of `UnderlyingTask`-s).
* Split production / consumption of the tasks to match the datastructure changes described above.
* Modify the `defer` block of `SelectableEventLoop.run()` to execute immediate tasks, instead of dropping them silently.

Result:

Performance of workloads relying on queuing lots of tasks gets better.
@gmilos gmilos force-pushed the gm-selectableeventloop-job-queueing-optimisations branch from daf6e3d to 1b7ad7d Compare February 13, 2024 10:04
@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@swift-nio-bot perf test please

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

I reviewed the test responsible for:

lock_8_threads_10M_ops | 0.957658591 | 0.873560162 | previous | 9%

This doesn't exercise my code change in any meaningful way. So the degradation must be due to an unrelated difference. I'm reviewing the test @weissi pointed out:

schedule_and_run_100k_tasks | 0.252803345 | 0.233980813 | previous | 8%

This may be down to me.

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

I reviewed:

schedule_and_run_100k_tasks | 0.252803345 | 0.233980813 | previous | 8%

In the current version of the PR (which isn't the same as what was perf tested before), I see ~2x improvement. I had to raise the number of repeats by 100x to make it human scale. I also recorded sample. Which point to better arc performance with the simpler task type stored in ScheduledTask.

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@swift-server-bot test this please

@weissi
Copy link
Member

weissi commented Feb 13, 2024

@swift-server-bot test perf please

@weissi
Copy link
Member

weissi commented Feb 13, 2024

@swift-server-bot perf test please

@weissi
Copy link
Member

weissi commented Feb 13, 2024

@swift-server-bot add to allowlist

@weissi
Copy link
Member

weissi commented Feb 13, 2024

@Lukasa the perf tests are broken: Both of the results are from 3rd Jan (run id 156). Don't know why it still posts them as if they were current

@gmilos gmilos force-pushed the gm-selectableeventloop-job-queueing-optimisations branch from e05c0c7 to a51573a Compare February 13, 2024 15:41
@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

I did a manual performance test run on my desktop. I run perf tests on main & my branch. They show major wins:

future_whenallsucceed_10k_deferred_off_loop	0.011824291	0.0222875	current	-46%
future_whenallcomplete_10k_deferred_off_loop	0.005605708	0.016276791	current	-65%
schedule_and_run_100k_tasks	0.108089916	0.200684584	current	-46%
execute_100k_tasks	0.014859583	0.153192833	current	-90%

There are couple of regressions also:
1.

lock_8_threads_10M_ops	0.215040375	0.18947725	previous	13%

but it looks like the lock tests are noisy (some of the others go faster, so it's probably just natural variability in the OS locking)

schedule_100k_tasks	0.037446041	0.033186875	previous	12%

When I started looking into the problem, I realised the test is badly implemented. The test prewarms the heap with 100k tasks, runs them to completion, and hands over for the measured test run. But then, the test loop is run 10x, thus placing 10x as many tasks into the heap than what the prewarm did. We're therefore testing the efficiency of heap doubling and not just the scheduling. I was able to prove this is indeed responsible for the test instability by running it with much bigger set of tasks:

measuring: schedule_10000k_tasks: 3.708622625, 4.334766542, 3.842835292, 4.574104792, 7.119135958, 4.038103833, 4.324497875, 4.464178167, 4.49384075, 4.640537542,

The runs range from 3.7 to 7.1.

I'll try to fix this test separately from this PR.

I'm also attaching the full manual perf test.

With all of this, @Lukasa the PR is ready for re-review.

swift-nio-perf-test-comparison-82fd942745b11ccebbf0db3e9e4bf150b60e5e44-to-e05c0c73d658ea6c4a38df866294f00941479b55.md

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@swift-server-bot test this please

2 similar comments
@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@swift-server-bot test this please

@gmilos
Copy link
Contributor Author

gmilos commented Feb 13, 2024

@swift-server-bot test this please

@weissi
Copy link
Member

weissi commented Feb 13, 2024

I did a manual performance test run on my desktop. I run perf tests on main & my branch. They show major wins:

future_whenallsucceed_10k_deferred_off_loop	0.011824291	0.0222875	current	-46%
future_whenallcomplete_10k_deferred_off_loop	0.005605708	0.016276791	current	-65%
schedule_and_run_100k_tasks	0.108089916	0.200684584	current	-46%
execute_100k_tasks	0.014859583	0.153192833	current	-90%

nice!!


There are couple of regressions also: 1.

lock_8_threads_10M_ops 0.215040375 0.18947725 previous 13%


but it looks like the lock tests are noisy (some of the others go faster, so it's probably just natural variability in the OS locking)

yes, ignore the locks one for this. They depend on the contention reached.

schedule_100k_tasks	0.037446041	0.033186875	previous	12%

When I started looking into the problem, I realised the test is badly implemented. The test prewarms the heap with 100k tasks, runs them to completion, and hands over for the measured test run. But then, the test loop is run 10x, thus placing 10x as many tasks into the heap than what the prewarm did. We're therefore testing the efficiency of heap doubling and not just the scheduling. I was able to prove this is indeed responsible for the test instability by running it with much bigger set of tasks:

measuring: schedule_10000k_tasks: 3.708622625, 4.334766542, 3.842835292, 4.574104792, 7.119135958, 4.038103833, 4.324497875, 4.464178167, 4.49384075, 4.640537542,

The runs range from 3.7 to 7.1.

I'll try to fix this test separately from this PR.

awesome, thank you!!

Co-authored-by: Johannes Weiss <johannesweiss@apple.com>
@gmilos
Copy link
Contributor Author

gmilos commented Feb 14, 2024

@weissi that's all the feedback handled. lmk if there is any more.

@gmilos
Copy link
Contributor Author

gmilos commented Feb 14, 2024

@swift-server-bot test this please

1 similar comment
@gmilos
Copy link
Contributor Author

gmilos commented Feb 14, 2024

@swift-server-bot test this please

Copy link
Member

@weissi weissi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That looks good to me.

@Lukasa Lukasa added semver/patch No public API change. semver/minor Adds new public API. and removed semver/patch No public API change. labels Mar 1, 2024
@Lukasa
Copy link
Contributor

Lukasa commented Mar 1, 2024

From a "strict semver" this is a patch change, but I'm going to take some editorial discretion and mark it "minor". We're making a subtle but observable behavioural change, and while it will probably be fine I want to make it possible for users to ensure that they don't have to deal with both variants by calling this a minor instead of a patch.

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking really nice. I have a few small refactoring suggestions, as we're in the space and you've already started with refactors ;)

Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
Sources/NIOPosix/SelectableEventLoop.swift Outdated Show resolved Hide resolved
}
}

assert(self.tasksCopy.count <= Self.taskCopyBatchSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion seems like it is straightforward to break. We can add two tasks in each loop iterator above, but we only check the length once. I think the loop above needs an extra length check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Reorganising the loop to check the length of the taskCopy array at each point we're mutating (adding to) it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I tried to write a test for this, and realised the code before was fine, just. It's because:

  • if _scheduledTasks and _immediateTasks are non-empty, we'd be adding 2 items to the tasksCopy each time round the loop. And taskCopyBatchSize is even, so one test every 2 items is sufficient
  • if either _scheduledTasks or _immediateTasks run out of tasks, we'd be testing the size of tasksCopy once per each addition. This applies even if we just run out when we're terminating the batch.

This wasn't intentional micro-optimisation, so I'm adding the additional length check back to make the code safer against future refactoring (including making the size of the batch odd).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah my concern was very much about future refactors 😉

immediateTasksCopy.reserveCapacity(self._immediateTasks.count)
while let immediate = self._immediateTasks.popFirst() {
immediateTasksCopy.append(immediate)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this while let, can we use swap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removing tasks from self._immediateTasks and putting them to a different array immediateTasksCopy. So swap isn't applicable, unless there is some other swap I'm not aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmilos gmilos requested a review from Lukasa March 4, 2024 16:17
@gmilos gmilos force-pushed the gm-selectableeventloop-job-queueing-optimisations branch from 0ab81f4 to 4fb9f15 Compare March 5, 2024 08:17
@gmilos
Copy link
Contributor Author

gmilos commented Mar 5, 2024

Looks like Deque has different growth properties in comparison to CircularBuffer. I had to bump some alloc count budgets to account for that.

@gmilos
Copy link
Contributor Author

gmilos commented Mar 5, 2024



// nextScheduledTaskDeadline is the overall next deadline, but iff there are no more immediate tasks left.
return moreImmediateTasksToConsider ? now : nextScheduledTaskDeadline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we ever hit now? We can only get here if both moreImmediateTasksToConsider and moreScheduledTasksToConsider are both false, which should imply we never hit now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. This changed because the batch size checks got moved to each of the if-s, and not the loop overall. I think the statement is harmless, but can be simplified. Checking it carefully.

Copy link
Contributor Author

@gmilos gmilos Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement was indeed harmless, because while it'd always return the deadline for next scheduled task, this deadline is only enacted iff there are no more tasks to run. In all other cases we have to recheck, because more tasks may get queued up. Still, this deserved a cleanup, with now returned if there are any more immediate tasks left and an associated assert. Also, I added a test that tries to get immediate tasks stuck (which never reproduced a problem, but useful defence for the future).

@gmilos gmilos merged commit 5e47077 into apple:main Mar 5, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver/minor Adds new public API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants