Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing Context Preservation (async/await and various of libraries) #60

Closed
haotianw465 opened this issue Oct 19, 2018 · 32 comments
Closed

Comments

@haotianw465
Copy link
Contributor

haotianw465 commented Oct 19, 2018

Goal

The goal of this issue is to track all tracing context preservation related problems, dive deep on why some of them work and some of them don't, discuss and gather feedback for a way going forward to provide much better customer experience with X-Ray node tracing and further contribute/shape the node context propagation in general.

Problem Statement

The Node.js event loop model has its downsides. One of those is the context preservation through the call chain within a single http request. It is hard to bind arbitrary data (user name, transaction id, tracing context etc) to a full request/response cycle and all relevent context stops at event loop. It's hard to know what is the context for the next callback being executed by event loop.

There is a library continuation-local-storage (CLS for short) that solved most of the problem by binding request context to callbacks. However, this library has two limitations:

  • It's both a bug and a feature that CLS doesn't work with all node modules. The full reasoning from CLS author is here. In short it's difficult for CLS to know whether it should bind the context or not all the time.
  • Node community evolves fast and with the major adoption of async/await syntax from Node 8 CLS doesn't fit some new use cases anymore.

Challenges for X-Ray SDK for Node.js

With all the context preservation issues listed above, they are solvable for most of the node application owners who need to preserve contextual data within requests.

  • For the first issue, CLS provides various of wrappers with name like cls-* for explicitly binding context to callbacks in a particular module. Here is an example fork for node-redis that patches Redis driver for context preservation.
  • For the second issue (async/await), cls-hooked are reported to be worked well with Node 8 so far by several SDK users. Zone.js is also a potential candidate for the new dependency of the SDK for context binding.

These workarounds are not ideal but work for each service owner as each service only has limited dependency and the owner knows what dependencies the service depends on. It's inevitable and necessary work for service owners to understand this problem space and make necessary code changes since it is business data (user name, transaction id etc) through a request/response cycle.

But for X-Ray SDK one of the core functionalities is to preserve the tracing context through the async call chain and the SDK should abstract this technical challenge away from users who just want all traced functions work once they import the SDK. This issue #28 has some detailed discussion regarding this concern. There are also limitations and known issues on cls-hooked one of which is it doesn't work with some certain libraries (example) while the X-Ray SDK should try not to have the SDK users to deal with these caveats.

Going Forward

(This section will likely be updated more frequently based on both internal and external discussions and feedback.)

The SDK should re-evaluate CLS as its core dependency and leverage the new tools (cls-hooked or zone.js) for preserving context for most of the major use cases today as well as being flexible enough to be forward compatible with new coming features.

A short-term solution could be to start with an experimental branch that replaces CLS with cls-hooked or Zone.js and thorough testing with existing supported libraries and new use cases we intend to cover. Or an alternative approach is to have something in user code with

var cls = process.env.AWS_XRAY_EXPERIMENTAL_ASYNC_AWAIT ? require('cls-hooked') : require('continuation-local-storage');

as suggested by @nakedible-p but the issue is that it makes the bundle size more bloated when the SDK is a dependency. Additionally, these libraries are not owned by AWS X-Ray and some critical fixes/supports for the library from X-Ray's prospective might not be as critical for those libraries' owners.

Long Term: We will share this portion with broader audience after internal discussions. Any suggestion is welcome.

@rochdev
Copy link

rochdev commented Nov 19, 2018

We're having a similar conversation in opentracing/opentracing-javascript#112. Since the purpose and the concerns are the same, I'm thinking maybe you would be interested to join the conversation and maybe this can be made as a library that is not specific to either OpenTracing or AWS X-Ray.

Thoughts?

P.S.: there is a know memory leak in async-listener which is used by continuation-local-storage and promise context handling is different between continuation-local-storage and cls-hooked which is described in opentracing/opentracing-javascript#112 (comment). Please be careful if you allow switching between the 2.

@chrisradek
Copy link
Contributor

I've created an experimental branch to test out cls-hooks. The branch is currently published to npm using the @experimental prerelease tag. This is to immediately unblock usage of async/await and give us a chance to experiment with other options as well.

@dougmoscrop
Copy link

hey @chrisradek - do you expect this experimental release to work? I can't seen to pinpoint the issue but subsegments are missing, sometimes they seem to show up a minute or so after the x-ray trace, but other times never.

@dougmoscrop
Copy link

dougmoscrop commented Nov 30, 2018

Edit: rewriting this comment for brevity:

It looks like the last subsegment is not transmitted until the next trace.

Consider the following code:

'use strict';

const AWSXRay = require('aws-xray-sdk');

AWSXRay.capturePromise();

module.exports.handler = async () => {
  for (let i = 0; i < 10; i++) {
    console.log('loop', i);
    await new Promise(resolve => {
      AWSXRay.captureAsyncFunc(`loop ${i}`, async subsegment => {
        await new Promise(resolve => {
          setTimeout(resolve, 100);
        });
        subsegment.close();
        resolve();
      });
    });
  }

  return 'Done';
};

Logs look like this:

First invoke:

2018-11-30 19:03:42.544 +00:00 [DEBUG] Starting the AWS X-Ray SDK in automatic mode (default).
2018-11-30 19:03:42.550 +00:00 [DEBUG] AWS_XRAY_CONTEXT_MISSING is set. Configured context missing strategy to LOG_ERROR.
2018-11-30 19:03:42.681 +00:00 [DEBUG] Subsegment streaming threshold set to: 0
START RequestId: a72f62a2-f4d2-11e8-a842-a79ab7976182 Version: $LATEST
2018-11-30 14:03:42.688 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 0
2018-11-30 19:03:42.689 +00:00 [DEBUG] Lambda trace data found: Root=1-5c01898e-31a001dd2fdc0f7321499b69;Parent=037843747bb819be;Sampled=1
2018-11-30 19:03:42.689 +00:00 [DEBUG] Segment started: {"Root":"1-5c01898e-31a001dd2fdc0f7321499b69","Parent":"037843747bb819be","Sampled":"1"}
2018-11-30 14:03:42.794 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 1
2018-11-30 19:03:42.799 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"e7dfbfcd75436281"}
2018-11-30 19:03:42.799 +00:00 [DEBUG] UDP message sent: {"id":"e7dfbfcd75436281","name":"loop 0","start_time":1543604622.689,"end_time":1543604622.791,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:42.895 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 2
2018-11-30 19:03:42.895 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"21ce9b9d08ce2054"}
2018-11-30 19:03:42.912 +00:00 [DEBUG] UDP message sent: {"id":"21ce9b9d08ce2054","name":"loop 1","start_time":1543604622.794,"end_time":1543604622.894,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:42.995 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 3
2018-11-30 19:03:42.996 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"90cc23cf106b1681"}
2018-11-30 19:03:42.996 +00:00 [DEBUG] UDP message sent: {"id":"90cc23cf106b1681","name":"loop 2","start_time":1543604622.895,"end_time":1543604622.995,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.096 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 4
2018-11-30 19:03:43.096 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"fd232f3f6d595171"}
2018-11-30 19:03:43.096 +00:00 [DEBUG] UDP message sent: {"id":"fd232f3f6d595171","name":"loop 3","start_time":1543604622.995,"end_time":1543604623.095,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.197 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 5
2018-11-30 19:03:43.197 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"00d7bbc1755fe7d6"}
2018-11-30 19:03:43.197 +00:00 [DEBUG] UDP message sent: {"id":"00d7bbc1755fe7d6","name":"loop 4","start_time":1543604623.096,"end_time":1543604623.197,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.298 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 6
2018-11-30 19:03:43.298 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"40e5a8db710501be"}
2018-11-30 19:03:43.299 +00:00 [DEBUG] UDP message sent: {"id":"40e5a8db710501be","name":"loop 5","start_time":1543604623.197,"end_time":1543604623.298,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.398 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 7
2018-11-30 19:03:43.399 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"73333e6ea3021d16"}
2018-11-30 19:03:43.399 +00:00 [DEBUG] UDP message sent: {"id":"73333e6ea3021d16","name":"loop 6","start_time":1543604623.298,"end_time":1543604623.398,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.499 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 8
2018-11-30 19:03:43.499 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"da8569a027a13dc0"}
2018-11-30 19:03:43.499 +00:00 [DEBUG] UDP message sent: {"id":"da8569a027a13dc0","name":"loop 7","start_time":1543604623.398,"end_time":1543604623.498,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:03:43.600 (-05:00)	a72f62a2-f4d2-11e8-a842-a79ab7976182	loop 9
2018-11-30 19:03:43.600 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"047d2ef42419b69b"}
2018-11-30 19:03:43.600 +00:00 [DEBUG] UDP message sent: {"id":"047d2ef42419b69b","name":"loop 8","start_time":1543604623.499,"end_time":1543604623.6,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
END RequestId: a72f62a2-f4d2-11e8-a842-a79ab7976182
REPORT RequestId: a72f62a2-f4d2-11e8-a842-a79ab7976182	Duration: 1015.84 ms	Billed Duration: 1100 ms 	Memory Size: 1024 MB	Max Memory Used: 44 MB

Second invoke (you can see the first thing it does is send the last subsegment for trace 1-5c01898e-31a001dd2fdc0f7321499b69)

START RequestId: e44c38f6-f4d2-11e8-a610-a561c6876dc2 Version: $LATEST
2018-11-30 19:05:24.610 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c01898e-31a001dd2fdc0f7321499b69","id":"ce80317ec63a16de"}
2018-11-30 19:05:24.610 +00:00 [DEBUG] UDP message sent: {"id":"ce80317ec63a16de","name":"loop 9","start_time":1543604623.6,"end_time":1543604623.701,"type":"subsegment","parent_id":"037843747bb819be","trace_id":"1-5c01898e-31a001dd2fdc0f7321499b69"}
2018-11-30 14:05:24.611 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 0
2018-11-30 19:05:24.611 +00:00 [DEBUG] Lambda trace data found: Root=1-5c0189f4-df208b36e5d0be99515006f5;Parent=7fc7587c19777e0e;Sampled=1
2018-11-30 19:05:24.611 +00:00 [DEBUG] Segment started: {"Root":"1-5c0189f4-df208b36e5d0be99515006f5","Parent":"7fc7587c19777e0e","Sampled":"1"}
2018-11-30 14:05:24.712 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 1
2018-11-30 19:05:24.712 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"592f5f95b4502724"}
2018-11-30 19:05:24.712 +00:00 [DEBUG] UDP message sent: {"id":"592f5f95b4502724","name":"loop 0","start_time":1543604724.611,"end_time":1543604724.712,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:24.813 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 2
2018-11-30 19:05:24.813 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"f34c185682339d91"}
2018-11-30 19:05:24.814 +00:00 [DEBUG] UDP message sent: {"id":"f34c185682339d91","name":"loop 1","start_time":1543604724.712,"end_time":1543604724.813,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:24.913 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 3
2018-11-30 19:05:24.914 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"8bf5b42ee0883e35"}
2018-11-30 19:05:24.914 +00:00 [DEBUG] UDP message sent: {"id":"8bf5b42ee0883e35","name":"loop 2","start_time":1543604724.813,"end_time":1543604724.913,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.014 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 4
2018-11-30 19:05:25.014 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"f3e54a6646c673bf"}
2018-11-30 19:05:25.014 +00:00 [DEBUG] UDP message sent: {"id":"f3e54a6646c673bf","name":"loop 3","start_time":1543604724.913,"end_time":1543604725.013,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.115 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 5
2018-11-30 19:05:25.115 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"d723888ad9e92f67"}
2018-11-30 19:05:25.116 +00:00 [DEBUG] UDP message sent: {"id":"d723888ad9e92f67","name":"loop 4","start_time":1543604725.014,"end_time":1543604725.115,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.215 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 6
2018-11-30 19:05:25.216 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"ff429de06db0f5c2"}
2018-11-30 19:05:25.216 +00:00 [DEBUG] UDP message sent: {"id":"ff429de06db0f5c2","name":"loop 5","start_time":1543604725.115,"end_time":1543604725.215,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.315 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 7
2018-11-30 19:05:25.316 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"6f36294a32882eaf"}
2018-11-30 19:05:25.316 +00:00 [DEBUG] UDP message sent: {"id":"6f36294a32882eaf","name":"loop 6","start_time":1543604725.215,"end_time":1543604725.315,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.417 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 8
2018-11-30 19:05:25.417 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"1bf01e6cd64a2193"}
2018-11-30 19:05:25.417 +00:00 [DEBUG] UDP message sent: {"id":"1bf01e6cd64a2193","name":"loop 7","start_time":1543604725.316,"end_time":1543604725.417,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
2018-11-30 14:05:25.517 (-05:00)	e44c38f6-f4d2-11e8-a610-a561c6876dc2	loop 9
2018-11-30 19:05:25.518 +00:00 [DEBUG] Subsegment sent: {"trace_id:"1-5c0189f4-df208b36e5d0be99515006f5","id":"26ecc12d7a9476bf"}
2018-11-30 19:05:25.518 +00:00 [DEBUG] UDP message sent: {"id":"26ecc12d7a9476bf","name":"loop 8","start_time":1543604725.417,"end_time":1543604725.517,"type":"subsegment","parent_id":"7fc7587c19777e0e","trace_id":"1-5c0189f4-df208b36e5d0be99515006f5"}
END RequestId: e44c38f6-f4d2-11e8-a610-a561c6876dc2
REPORT RequestId: e44c38f6-f4d2-11e8-a610-a561c6876dc2	Duration: 1015.74 ms	Billed Duration: 1100 ms 	Memory Size: 1024 MB	Max Memory Used: 44 MB

@chrisradek
Copy link
Contributor

@dougmoscrop
Thanks for reporting your findings on using the experimental branch. I'll take a look at your test case and investigate why the subsegment isn't being sent when each Lambda invocation ends.

By chance, have you seen this issue happening outside of Lambda functions?

@dougmoscrop
Copy link

Sorry, I don't have any non-Lambda based use of X-Ray.

I think if anything this can make it appear like the experimental branch isn't working although It seems a separate issue from actual async/await support.

@jboehle
Copy link

jboehle commented Jan 23, 2019

Any update on this? AWS Lambda has had a Node.js 8 runtime since 4/2/18. When is this SDK going to have a stable release that supports async/await?

@dougmoscrop
Copy link

I haven't had time to really dive in to it, but after deploying the experimental branch, our traces still lack hierarchy too. It looked like it was working but wasn't. I realize this is a crappy/unhelpfully vague comment, but yeah this incompatibility between AWS services is being woefully under-addressed :(

@chrisradek
Copy link
Contributor

@dougmoscrop
Can you share an example of what your traces look like that lack the hierarchy you'd expect?

@dougmoscrop
Copy link

dougmoscrop commented Jan 23, 2019

Sure, here's an example:

screen shot 2019-01-23 at 1 04 01 pm

This is a Lambda processing a Kinesis stream. It looks up a "pointer" from Dynamo to the current S3 key then it updates that object with the new records, saves it as a new S3 object, then moves the Dynamo pointer to the new S3 key (key is a uuid), then deletes the old one.

So it's Dynamo.Get -> S3.Get -> (business logic) -> S3.Put -> Dynamo.Update -> S3.Delete

We open a custom segment for this workflow called Context View. But most of the S3/Dynamo operations do not appear as children of it. You can see that they 'belong' to the Context View because it spans the entire time of all the operations. However, the records in our stream are 'grouped' in to separate Context Views depending on some property; so while the attached image shows only one, there could be a dozen or a hundred depending on what's in the Kinesis stream. When this happens, it's much less clear which sub-segments correlate to which job.

What I would expect is that simply all the S3 and Dynamo calls you see in the above screenshot show up as children of Context View.

Edit: Oh and then there are times where there's a parent-child hierarchy that is incorrect, such as:

screen shot 2019-01-23 at 1 19 39 pm

PutObject happens entirely after GetObject, not within it as a call. The gap you see is our "business logic" as I mentioned above, so its like this:

await s3.get(...)
await businessLogic(...)
await s3.put(...)

But it shows up as if s3.get was calling s3.put

Stuff like this makes it really hard to know what to believe, like seeing S3 taking 100+ms to get an object that is probably a kilobyte at most; we checked with AWS support and S3 logs show response times more like 13ms. It could totally be a bug in the SDK or the way we're calling it, but it makes us wonder if it's just doing accounting wrong. For that issue specifically, we can start instrumenting our code ourselves, but that kinda compromises the X-Ray value prop.

@styfle
Copy link

styfle commented Feb 1, 2019

Hello 👋

I want to be able to log the time it takes for an async function to return.

I tried the latest version of aws-xray-sdk available on npm (2.2.0) as well as the experimental version (2.1.0-experimental.1) and both of these resulted in the same error.

2019-02-01 20:34:44.496 +00:00 [ERROR] Error: Failed to get the current sub/segment from the context.
at Object.contextMissingLogError [as contextMissing] (/usr/src/app/node_modules/aws-xray-sdk-core/lib/context_utils.js:26:19)
at Object.getSegment (/usr/src/app/node_modules/aws-xray-sdk-core/lib/context_utils.js:92:45)
at Object.resolveSegment (/usr/src/app/node_modules/aws-xray-sdk-core/lib/context_utils.js:73:19)
at Object.captureAsyncFunc (/usr/src/app/node_modules/aws-xray-sdk-core/lib/capture.js:55:32)

Here's my code

const AWSXRay = require('aws-xray-sdk');
AWSXRay.captureAWS(require('aws-sdk'));
AWSXRay.middleware.enableDynamicNaming();

// This sleep function is to demonstrate an async function
function sleep(ms) {
    return new Promise((resolve, reject) => {
        setTimeout(resolve, ms);
    });
}

function captureAsync(func, args) {
	return new Promise((resolve, reject) => {
		AWSXRay.captureAsyncFunc(func.name, (seg) => {
			// The seg is `undefined` here
			func.apply(this, args)
				.then((result) => {
					seg.close();
					resolve(result);
				})
				.catch((error) => {
					seg.close(error);
					reject(error);
				});
		});
	});
}

const trace = (handler) => async (req, res) => {
	const open = AWSXRay.express.openSegment('some-name');
	if (!res.req) {
		// This was necessary for another error
		res.req = req;
	}
	open(req, res);
    // Throws in `captureAsync`
	await captureAsync(sleep, [1000]);
	const handlerResponse = await handler(req, res);
	const close = AWSXRay.express.closeSegment();
	close(req, res);
	return handlerResponse;
}

module.exports = { trace };

Am I using this API incorrectly or is this the same bug others are experiencing?

@chrisradek
Copy link
Contributor

@styfle
Are you running this example in Lambda or another environment?

@styfle
Copy link

styfle commented Feb 1, 2019

@chrisradek No lambda. This is running on my own machine in a Docker container.

@chrisradek
Copy link
Contributor

@styfle
This looks like an issue with the way you've added the x-ray middleware to your express app, combined with using async/await. You're right in that you would want to use the experimental release since you're using async/await.

I roughly converted your example to the following:

const trace = require('./trace').trace;
const AWSXRay = require('aws-xray-sdk');
const express = require('express');

const app = express();

const handle = trace(console.log);
const open = AWSXRay.express.openSegment('some-name');
const close = AWSXRay.express.closeSegment();
app.use(open);
app.get('/', handle);
app.use(close);
app.listen(3001);

I also removed the open/close segment logic from your trace handler:

const trace = (handler) => async (req, res) => {
	if (!res.req) {
		// This was necessary for another error
		res.req = req;
	}
	await captureAsync(sleep, [1000]);
	const handlerResponse = await handler(req, res);

	return handlerResponse;
}

It looks like we currently require the next function to be passed to the openSegment handler, which express will handle for us. I'm not sure what would happen if you also used multiple async middleware. Does the above change workaround the issue you're seeing?

This issue might be specific to our express middleware if it doesn't handle multiple async middlewares in the chain, so we should open another issue to track that if this is the case.

@styfle
Copy link

styfle commented Feb 1, 2019

@chrisradek Thanks for the response! 😃

I'm not using express, I'm using micro which looks a lot like the Lambda API.

Maybe the missing next function is my problem?

@chrisradek
Copy link
Contributor

@styfle
Ah, the middleware you were using is meant for express, so I'm not sure it will work with micro. You might be able to get it to 'work' by passing the rest of your function as a callback to the open method you've defined, and calling close inside that as well.

You can also instrument your code manually. Here's an example of what that could look like, though I haven't used micro so not sure if there are any problems with this approach:

const AWSXRay = require('aws-xray-sdk');

async function example(foo, bar) {
  const namespace = AWSXRay.getNamespace();
  const segment = new AWSXRay.Segment('some-name');
  return await namespace.runAndReturn(async () => {
    AWSXRay.setSegment(segment);

    await captureAsync(sleep, [1000]);
    await sleep(100);

    segment.close();
    return 'done';
  });
}

// This sleep function is to demonstrate an async function
function sleep(ms) {
  return new Promise((resolve, reject) => {
      setTimeout(resolve, ms);
  });
}

function captureAsync(func, args) {
  return new Promise((resolve, reject) => {
    AWSXRay.captureAsyncFunc(func.name, (seg) => {
      // The seg is `undefined` here
      func.apply(this, args)
        .then((result) => {
          seg.close();
          resolve(result);
        })
        .catch((error) => {
          seg.close(error);
          reject(error);
        });
    });
  });
}

example('a', 'b');

This example creates a segment everytime example (the handler) is called, and closes it when it is done.

@styfle
Copy link

styfle commented Feb 2, 2019

@chrisradek You are amazing!!!!! 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

I spent hours trying to figure this one out and your solution (with a minor modification) worked 😄

Now the only thing missing is the URL disappeared from the console 🤔

image

@styfle
Copy link

styfle commented Feb 4, 2019

@chrisradek On second thought, there are still problems when calling capture more than once. You can use this repository to reproduce the issue https://github.com/styfle/aws-xray-bug

Would you like me to create a new issue here or is this thread fine?

@chrisradek
Copy link
Contributor

@styfle
Would you mind creating a new thread for this issue?

@dougmoscrop
Copy link

Hey @chrisradek - would it be helpful if we opened an issue with our enterprise support? Or is this thread enough?

@okonon
Copy link

okonon commented Apr 6, 2019

Also interested in async/await support. More and more of modern apps are switching to this pattern and reducing X-Ray's value proposition.

@awssandra
Copy link
Contributor

@okonon Agreed. Proper context propagation is still a difficult problem to solve in the JS world. The discussion is also present on the OpenTracing github page, and it seems zone.js will not fix the async/await issue.

A experimental branch is available, however, cls-hooked still has known issues, and does not appear to be in active development at this time.
For further reading: opentracing/opentracing-javascript#112

@alvis
Copy link

alvis commented Apr 12, 2019

I have the same issue as @dougmoscrop that the last subsegment is not transmitted until the next trace.

I tried both the ordinary and the experimental releases, and both of them have the same issue. My code is deployed as a lambda function, using the serverless framework for packaging and deployment.

@awssandra
Copy link
Contributor

awssandra commented Apr 23, 2019

@alvis

Sorry for the issues, this is a known issue with the Lambda Async handler. We are working internally to have it fixed.

@joebowbeer
Copy link
Contributor

Now that DataDog, Honeycomb, NewRelic, Stackdriver, and even Bluebird are all leveraging async_hooks, is it time to mainline the experimental branch?

@akleiber
Copy link

akleiber commented Jul 1, 2019

@awssandra any update on the async problem with not sending the last segment?
I tried to workaround with sending an extra segment "workaround" but I failed. The only way the last segment is sent, is if the lambda if the same container is called again.

:-( x-ray is amazing but is not useable this way in production environments

@okonon
Copy link

okonon commented Aug 22, 2019

Also interested in the update regarding this issue

@keyvhinng
Copy link

I wonder what's the current status of this. I'm very interested in this issue #4 (Koa support)

I see that some PRs were merged but I'm not sure if everything needed to support koa is already done

@willarmiros
Copy link
Contributor

This issue supporting tracing of async/await calls has been resolved using the cls-hooked library and is available in two different distribution tags:

# For stable release of SDK, version 2.5.0-experimental.1 
npm install --save aws-xray-sdk@experimental

# For prerelease of next major version of SDK, version 3.0.0-alpha.1
npm install --save aws-xray-sdk@alpha

This issue will be closed when 3.0.0 is released, which is tracked in issue #157. As for explicitly supporting the Koa framework, that has not been prioritized yet on our roadmap.

@jgcmarins
Copy link

Any updates here?
I am about to setup X-Ray on a Koa + GraphQL server.

@willarmiros
Copy link
Contributor

For Koa support, please post on #4. At this time we don't have the bandwidth to add support for the Koa framework ourselves, but are always open to PRs.

@willarmiros
Copy link
Contributor

Closing this issue since the use of cls-hooked was mainlined in 3.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests