Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add poller autoscaling in activity and decision workers #1186

Merged
merged 7 commits into from
Sep 2, 2022

Conversation

shijiesheng
Copy link
Contributor

@shijiesheng shijiesheng commented Aug 30, 2022

What changed?

  • add pollerAutoScalerOptions
  • surface user set options and the feature flag to enable poller autoscaling
  • start autoscaling in the workers

Why?

How did you test it?

Potential risks

@coveralls
Copy link

coveralls commented Aug 31, 2022

Pull Request Test Coverage Report for Build 0182fefc-7a13-49a1-8887-001a9f9b0ce0

  • 79 of 81 (97.53%) changed or added relevant lines in 3 files are covered.
  • 13 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.04%) to 64.164%

Changes Missing Coverage Covered Lines Changed/Added Lines %
internal/internal_worker_base.go 32 34 94.12%
Files with Coverage Reduction New Missed Lines %
internal/compatibility/thrift/enum.go 4 21.2%
internal/compatibility/thrift/types.go 9 48.42%
Totals Coverage Status
Change from base Build 0182dc27-c213-4eb3-a77e-c81ef840dc7b: 0.04%
Covered Lines: 12598
Relevant Lines: 19634

💛 - Coveralls

@shijiesheng shijiesheng changed the title [WIP] add poller autoscaling in activity and decision workers add poller autoscaling in activity and decision workers Aug 31, 2022

// optional: Sets the interval of poller autoscaling, between which poller autoscaler changes the poller count
// based on poll result. It takes effect if FeatureFlags.PollerAutoScalerEnabled is set to true.
// Default value is 1 min
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1 minute fast enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still experimenting on this. Since we ensure at least one poller is always working, the impact of this parameter should be low.

targetMilliUsage uint64,
isDryRun bool,
cooldownTime time.Duration,
options pollerAutoScalerOptions,
logger *zap.Logger,
hooks ...func()) *pollerAutoScaler {
ctx, cancel := context.WithCancel(context.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a sign we should be passing in a context? e.g. to propagate shutdown information maybe.


// Stop stops the poller autoscaler
func (p *pollerAutoScaler) Stop() {
p.cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not relevant to this PR, but in general: we probably want a (time-bounded?) blocking "... and wait for shutdown to complete" behavior here.
otherwise it's impossible to rule out dangling goroutines (e.g. goleak tests) or race conditions (background loop triggering at the same time as cancelation).

Comment on lines 278 to 285
if bw.pollLimiter == nil || bw.pollLimiter.Wait(bw.limiterContext) == nil {
if bw.pollerAutoScaler != nil {
if pErr := bw.pollerAutoScaler.Acquire(1); pErr == nil {
defer bw.pollerAutoScaler.Release(1)
} else {
bw.logger.Warn("poller auto scaler acquire error", zap.Error(pErr))
}
}
Copy link
Contributor

@Groxx Groxx Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat pedantic / realistic values seem unlikely to cause problems here, but I'm pretty sure this ordering means this is possible:

  • autoscale from N to 1
  • allow N to get through pollLimiter, while autoscaler allows one long request and blocks N-1
  • long request finishes, N-1 run through as fast as possible, ignoring ratelimter
    • sequentially, as autoscaler is at 1. or 2 at once if at 2, etc.

swapping the order seems better, since "waiting on the per-second request limiter" feels like "a concurrent poller" to me. the request itself isn't running, but we've reserved space for it to run "on time". so we'll degrade "concurrent requests on the network" (allowing it to go lower than specified, though it is in principle just a maximum) in favor of preserving "maximum requests per second".

Copy link
Contributor

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor thing worth changing, but otherwise this all looks reasonable / is mostly "just" a (good) refactor.

so 👍 structurally LGTM. I should read more of the autoscaler before reviewing in more depth though.


unrelated / not a problem for any realistic values for us, but the semaphore library doesn't detect overflows: https://github.com/marusama/semaphore/blob/master/semaphore.go#L98
we should probably make a PR for that. it'd be easy to fix.

@shijiesheng shijiesheng merged commit 3a730da into uber-go:master Sep 2, 2022
maeer pushed a commit to maeer/cadence-client that referenced this pull request Jun 29, 2023
* add pollerAutoScalerOptions

* add tests for options

* change autoscaler interface to include Stop method

* add poller auto scaler to task worker

* gracefully stop autoscaler

* move up autoscaler in pollTask

* add unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants