Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stats to DatastreamTaskImpl #855

Merged
merged 5 commits into from
Oct 25, 2021

Conversation

vmaheshw
Copy link
Collaborator

We frequently hear need to get some of the task level metrics for diagnostics that can be retrieved using the brooklin-service end-point.

LoadBasedPartitionAssignmentStrategy distributes the partitions evenly based on the load. To be able to debug and validate the distribution, it is important to be able to pull out the metrics at task level and perform offline analytics on the data.

This PR exposes a new knob stats that can used to save the task level stats on the zookeeper and can be used to retrieve similar to other end-points.

@vmaheshw vmaheshw marked this pull request as draft October 19, 2021 07:50
@vmaheshw vmaheshw marked this pull request as ready for review October 19, 2021 17:54
Copy link
Collaborator

@surajkn surajkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why do we want to save these stats in ZK instead of simply reporting this as a task level metric ? Is it because in Ingraphs its hard/not possible to identify a specific task's metrics?

@vmaheshw
Copy link
Collaborator Author

Just curious, why do we want to save these stats in ZK instead of simply reporting this as a task level metric ? Is it because in Ingraphs its hard/not possible to identify a specific task's metrics?

There is a limitation of number of metrics that we can emit from the container. Also, if we want to build a diagnostics command to collect the information from large clusters and analyze the data, it is difficult with the metrics. Also, these metrics are emitted only by the leader and on leader switch, these metrics will not emitted until the datastream is restarted.

@vmaheshw vmaheshw requested a review from surajkn October 20, 2021 21:33
surajkn
surajkn previously approved these changes Oct 21, 2021
@@ -681,6 +682,11 @@ private void addTaskNodes(String instance, DatastreamTaskImpl task) {
KeyBuilder.datastreamTaskState(_cluster, task.getConnectorType(), task.getDatastreamTaskName());
_zkclient.ensurePath(taskStatePath);

// save the task stats.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the task node's directory structure in the method description above. This is a new subdirectory "stats" under the task, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, "stats" directory will be inside "state" directory and will be conditional.

DatastreamTaskImpl newTask) {
PartitionAssignmentStatPerTask stat = PartitionAssignmentStatPerTask.fromJson(((DatastreamTaskImpl) task).getStats());
if (partitionInfoMap.isEmpty()) {
stat.isThroughputRateLatest = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a timestamp field here instead of having the latest flag, so that we get a sense of the last partition throughput distribution more accurately?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can add timestamp. We still need the latest flag, because not all the partition assignments will use Throughput based balancing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address it separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! thanks

@vmaheshw vmaheshw merged commit 7c0aa1d into linkedin:master Oct 25, 2021
vmaheshw added a commit to vmaheshw/brooklin that referenced this pull request Mar 1, 2022
We frequently hear need to get some of the task level metrics for diagnostics that can be retrieved using the brooklin-service end-point.

LoadBasedPartitionAssignmentStrategy distributes the partitions evenly based on the load. To be able to debug and validate the distribution, it is important to be able to pull out the metrics at task level and perform offline analytics on the data.

This PR exposes a new knob stats that can used to save the task level stats on the zookeeper and can be used to retrieve similar to other end-points.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants