Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add opentelemetry-instrumentation-nats package #667

Closed
wants to merge 6 commits into from

Conversation

ekosz
Copy link

@ekosz ekosz commented Sep 19, 2021

Which problem is this PR solving?

There is no instrumentation for nats.

Short description of the changes

This adds a new instrumentation package for the nats messaging system. Specifically for its nats.js library.

Work In Progress

Currently this PR is a work in progress. This is my first time contributing an OpenTelemetry package and I wanted to get the PR out early in a draft state to make sure I'm on the write path before moving further along.

TODOS

  • Add to the examples
  • More unit tests
  • Finish up the readme
  • What else have I forgotten?

Open Questions

  • How does one deal with wrapping generators? The nats.js library exposes its subscriptions via async iterators. The poses a challenge with the ContextAPI#with function. I'm not sure how to combine #with with a yield call. This is causing a test to currently fail as the context is not propagating properly without it.
  • Am I setting the correct messaging attributes? This seems to be the first contrib package that adds instrumentation to a messaging library. I tried to make sure I followed the messaging standard, but I would love confirmation on that.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 19, 2021

CLA Signed

The committers are authorized under a signed CLA.

@codecov
Copy link

codecov bot commented Sep 19, 2021

Codecov Report

Merging #667 (d34cba0) into main (648ffb7) will decrease coverage by 0.13%.
The diff coverage is n/a.

❗ Current head d34cba0 differs from pull request most recent head aaa58dc. Consider uploading reports for the commit aaa58dc to get more accurate results

@@            Coverage Diff             @@
##             main     #667      +/-   ##
==========================================
- Coverage   96.82%   96.68%   -0.14%     
==========================================
  Files           9       13       +4     
  Lines         630      634       +4     
  Branches      124      124              
==========================================
+ Hits          610      613       +3     
- Misses         20       21       +1     
Impacted Files Coverage Δ
...metry-instrumentation-document-load/src/version.ts 100.00% <0.00%> (ø)
...ry-instrumentation-user-interaction/src/version.ts 100.00% <0.00%> (ø)
...apackages/auto-instrumentations-web/src/version.ts 0.00% <0.00%> (ø)
...web/opentelemetry-plugin-react-load/src/version.ts 100.00% <0.00%> (ø)

@ekosz ekosz force-pushed the feat-add-nats branch 3 times, most recently from d1970e5 to 0bce19a Compare September 19, 2021 21:02
@aricart
Copy link

aricart commented Sep 22, 2021

Firstly, let me begin by saying that I appreciate the open-telemetry integration for nats.js! This looks like it’ll work!

As is this will provide some great information, but bear in mind that this will come at the expense of additional effort maintaining this codebase. NATS releases that change internals of the nats.js library (which is shared by different runtimes - such as nats.js, nats.deno and nats.ws) could require changes here. If the module dependency is tightly clamped and you use the public APIs wherever possible this shouldn’t be a significant issue.

You’ll want to be aware of potential performance implications. Because the wrapped library effectively performs callouts outside, it means that depending on the implementation of the tracing, it could interrupt the normal flow of the client, and create hard-to-diagnose issues such as slow-consumers, and/or memory growth that wouldn’t appear in the standard client. You may want to mention this in any corresponding documentation.

I would like to propose that there may be a couple of things that you can do. It would seem reasonable to add a generator to capture simple metrics that can be gathered as needed. Such as the number of messages per specific subject sent or received. Note this metric is only “counts”.

As for observability, with the NATS ecosystem, you have the ability to do some incredible things by creating a simple subscription that looks at all the messages being published and correlating them. This means that observability can be inserted into the system while respecting the data privacy between clients. This has the benefit that the work required for capturing and transmitting this information is not placed on the client being observed but on a separate process, which can then record, report, or summarize whatever metrics are required.

The one drawback is that actual identification of the publishing client will be difficult, as the entire point of NATS messaging is to de-couple producers from consumers. With that said, at least service clients, you can easily identify them, if their reply inbox is coordinated with a will-know client name. Using some other internal metric generated by the client (which could be broadcasted using NATS messages), you could relate specific load on the clients.

All in all, this looks like it’ll do the job and we really appreciate the contribution and hard work here!

@ekosz
Copy link
Author

ekosz commented Sep 25, 2021

Hi @aricart. Thanks for taking a look at this PR! I understand your points completely. And I will definitely update the docs of this library to call out how this may have an affect on performance and/or memory growth.

But the main thing I was attempting to do with this addition was not necessarily add metric emission (to your point you can do that with a sidecar subscription), but instead distributed tracing. The ability to follow a message being emitted in one system and then follow it as it goes to each subscriber and how each of those subscriptions may emit additional messages you can trace/follow. I think tracing is a critical tool required of any large distributed system.

As for reaching into the private libraries/files of nats, you're again absolutely right that this approach is much less stable. But I don't think I could add the tracing I need with just the public API. AFAIK, the public API currently does not export the actual class that has the subscribe/publish/request functions. Instead it only imports the connect function that creates instance of that class. But for OT to instrument those calls it needs access to the class to patch those prototypes. If we wanted to only use public APIs, I think, we would need the nats team to expose more publicly. Maybe some function hooks libraries like OT can use to add additional instrumentation? Until then I think this will need to patch those private APIs. Unless there is a public API I'm missing? Or another idea you may have?

Thank you again for the detailed response!

@ekosz
Copy link
Author

ekosz commented Sep 25, 2021

@aricart Actually, thinking about it, I may only need the connect function after all. If I patch that function, I can return Proxy objects where I hook into those subscribe/publish/request calls. I could also proxy the Message#respond calls as well so I would no longer need to reach into the ProtocalImpl class.

I'll try that out next over the coming days and let you know how it goes.

@ekosz
Copy link
Author

ekosz commented Sep 26, 2021

@aricart I've switch to just using the public apis + proxy objects. Thank you for pushing me in that direction. This should be much more stable now.

@ekosz
Copy link
Author

ekosz commented Sep 26, 2021

@blumamir I still haven't figured out a great way to get around the issue with context.with and yield calls in generators. Do you / anyone on the OpenTelemetry team have a suggestion on what I may try next?

@blumamir
Copy link
Member

@blumamir I still haven't figured out a great way to get around the issue with context.with and yield calls in generators. Do you / anyone on the OpenTelemetry team have a suggestion on what I may try next?

Hi @ekosz , thanks for taking the time to add this instrumentation. I tried before to patch generators for the SQS messaging system in aws-sdk instrumentation, and had no luck in finding a method that worked for all cases, unfortunately. This is currently an open issue in SQS which I hope we can solve one day.

I'll take a second look to try and think about it again.

Am I setting the correct messaging attributes? This seems to be the first contrib package that adds instrumentation to a messaging library. I tried to make sure I followed the messaging standard, but I would love confirmation on that.

There are instrumentations for kafkajs, amqplib (RabbitMQ), and aws SQS messaging systems hosted here which you can examine for reference. Also, it's worth mentioning that there is a SIG group working on new stable messaging specification which will probably be ready in the following months. The current specification is in alpha and will probably change soon.

@vmarchaud
Copy link
Member

I still haven't figured out a great way to get around the issue with context.with and yield calls in generators. Do you / anyone on the OpenTelemetry team have a suggestion on what I may try next?

There is an effort to add manual context propagation using attach/detach but its still wip: open-telemetry/opentelemetry-js-api#123. Might worth to try this after its implemented inside the SDK

@blumamir
Copy link
Member

blumamir commented Sep 28, 2021

There is an effort to add manual context propagation using attach/detach but its still wip: open-telemetry/opentelemetry-js-api#123. Might worth to try this after its implemented inside the SDK

The issue I encountered with the approach was that I was not able to properly guarantee that the detach function will be called to restore the context in case of an exception or premature function return statement:

function * numbers() {
    yield 1;
    yield 2;
    yield 3;
}

const iterateOnNumbers = () => {
    try {
        for(const num of numbers()) {
            if(num === 2) {
                throw Error('throws without fully consuming the iterator till done');
            }
            console.log(`process num ${num}`);
        }
    } catch {
        console.log('exception handled');
    }    
    console.log('what do we expect the context here to be?');
}

const f = () => {
    iterateOnNumbers();
    console.log('what do we expect the context here to be?');
}

f();

If patching the generator's iteratornext function, and restoring the context when the iterator is done, the above code will leak the second iteration context to the code executed after the for loop

@vmarchaud
Copy link
Member

The issue I encountered with the approach was that I was not able to properly guarantee that the detach function will be called to restore the context in case of an exception or premature function return statement:

I never experimented with generators so i can't say for sure but do they create a new async context for each func call ? If so this should be tracked by async hooks

@blumamir
Copy link
Member

I never experimented with generators so i can't say for sure but do they create a new async context for each func call ? If so this should be tracked by async hooks

Tested it now and unfortunately it's not creating async hooks context

@vmarchaud
Copy link
Member

Tested it now and unfortunately it's not creating async hooks context

Well if it doesn't create nor an async context or a promise context (with promise_hooks) i don't think we can support it :/

@github-actions
Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Nov 29, 2021
@github-actions
Copy link
Contributor

This PR was closed because it has been stale for 14 days with no activity.

@neeraj-ec
Copy link

Any progress on this one? We are using NATS as inter-service communication for our NestJs microservices and need this for distributed tracing.
Seems like this one is closed due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants