Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix rebalancing-tasks bug and added tests #900

Merged

Conversation

shrinandthakkar
Copy link
Collaborator

@shrinandthakkar shrinandthakkar commented Mar 25, 2022

Problem:

Both the LoadBasedPartition and StickyPartition Strategy follow the StickyMulticastStrategy to assign the datastream tasks across the available instances.

In the StickyMulticastStrategy, the last step of the process is to rebalance the tasks across instances such that the number of tasks across the instances adheres to the imbalance threshold value.

Basically, the difference between the max #tasks assigned to any instance and the min #tasks assigned to any other instance should not exceed the imbalance threshold.

In the current implementation, there was a minor bug, which in some cases failed to achieve that proper rebalance. That issue causes validation failures and thus availability misses on the mirror-making service.


Failure Scenario: Observed in production service with 6 instances connected to ZK and has 538 tasks in total.

We saw from the production logs that the tasks were distributed across the instances with counts: 89, 89, 89, 89, 93, 89

And given that we enforce the imbalance threshold of 1, the actual distribution should have been something like
89, 89, 90, 90, 90, 90

In the below pasted rebalance code,

tasksTotal = 538
minTasksPerInstance = 538 / 6 = 89
And since the minimum number of tasks assigned == minTasksPerInstance, it won't go in the while loop to rebalance.

// STEP 3: Trigger rebalance if the number of different tasks more than the configured threshold

if (newAssignment.get(instancesBySize.get(instancesBySize.size() - 1)).size() - newAssignment.get(instancesBySize.get(0)).size() > _imbalanceThreshold) {
 int tasksTotal = newAssignment.values().stream().mapToInt(Set::size).sum();
 int minTasksPerInstance = tasksTotal / instances.size();

 // some rebalance to increase the task count in instances below the minTasksPerInstance
 while (newAssignment.get(instancesBySize.get(0)).size() < minTasksPerInstance) {
   ...
 }
}

Solution:

Update the rebalancing code condition to only terminate when the min #tasks assigned + the imbalance threshold is less than the max #tasks assigned.


Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email security@linkedin.com instead, as described in
the contribution guidelines.

Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md

Copy link
Collaborator

@vmaheshw vmaheshw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you improve the description explaining the problem in detail and how are you solving the problem?

@shrinandthakkar
Copy link
Collaborator Author

Can you improve the description explaining the problem in detail and how are you solving the problem?

Done

@@ -205,7 +205,8 @@ public StickyMulticastStrategy(Optional<Integer> maxTasks, int imbalanceThreshol
int minTasksPerInstance = tasksTotal / instances.size();

// some rebalance to increase the task count in instances below the minTasksPerInstance
while (newAssignment.get(instancesBySize.get(0)).size() < minTasksPerInstance) {
while (newAssignment.get(instancesBySize.get(0)).size() + _imbalanceThreshold < newAssignment.get(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot to parse in one line. Consider pulling out into a couple vars so the logic is more obvious.

@shrinandthakkar shrinandthakkar merged commit b821911 into linkedin:master Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants