Added auto resource detection proposal #111

james-bebbington · 2020-06-03T13:26:53Z

This OTEP proposes a mechanism to support automatic detection of resource information, including a default implementation to detect resource information from an environment variable. This is largely based on existing work in OpenCensus & the current implementation is OpenTelemetry JS SDK.

Related issues/PRs:

Resources: define string values and keys as urlencoded: resources: define string values and keys as urlencoded opentelemetry-specification#505
Resource initialization from environment variable: Resource initialization from environment variable opentelemetry-specification#535
Standardize environment variables between SDK implementations: Standardize environment variables between SDK implementations opentelemetry-specification#572
Add Library Info to trace/metric provider: Add Library Info to trace/metric provider opentelemetry-go#587 (not included in this OTEP but closely related)

Note this is my first time creating an OTEP. I'm not aware of all the previous discussion on this topic (couldn't find any directly related previous work), but figured I'd go ahead and create an OTEP to start a discussion at least, as it seems like there is currently a gap in the specification.

text/0111-auto-resource-detection.md

jrcamp · 2020-06-04T00:11:12Z

text/0111-auto-resource-detection.md

+
+In order to apply auto-detected resource information to traces & metrics, a user will need to:
+
+- Configure which resource detector(s) they would like to run (e.g. AWS EC2 detector)


Is explicit configuration required? Could we run all cloud providers and take the first one that succeeds? If none succeed then the Detect() call fails.

Good question - I probably should have added this as an Open question.

I've suggested a similar approach to what was in the OpenCensus spec which dicates that cloud-vendor specific detection must not be included in the SDK (and thus would require some configuration). But that may not be necessary. As you point out below, the code to lookup host / container metadata is usually trivial: lookup an environment variable(s) or get a text response from a known endpoint.

If we do allow cloud vendor-specific detectors to be included in the SDK that raises a few related questions:

Do we want to enforce that vendor libraries cannot be included in detectors?

Are we open to any cloud providers adding detection code here as long as they follow the above rule? (other than just major ones)

Do we attempt detection for all cloud providers by default? (including non-major ones)

If yes, then are there any cases where we would need to let users override this?

Do we want to enforce that vendor libraries cannot be included in detectors?

Yes IMO

Are we open to any cloud providers adding detection code here as long as they follow the above rule? (other than just major ones)

Yes as long as they add it to OT libraries for all languages.

Do we attempt detection for all cloud providers by default? (including non-major ones)
If yes, then are there any cases where we would need to let users override this?

Yeah this does seem like it could be tricky given the number of cloud providers and that they're on the rise. Could limit it by some factors, like "must have X% market share" to be included in autodetection but any legitimate provider can still add a provider as a fallback if they don't meet that threshold?

I think ideally we would run all autodetectors in parallel with a relatively small time limit (we use 1s by default in SignalFx agent but I don't have any hard data supporting what the correct value should be).

I would start with requiring explicit enumeration of detectors by the user. It can always be extended to an auto-detector.

Note the overarching question here is:

"Is it okay for cloud-vendor specific detectors to live in the SDK directly as long as they don't include any third party libraries?"

I'll wait for a couple more opinions before updating the OTEP but if there is general consensus that this is okay, then I'll get rid of the notes about custom detectors living in separate packages.

text/0111-auto-resource-detection.md

MrAlias

I am in support of this proposal.

I think the details need to be ironed out, but I like the general approach.

text/0111-auto-resource-detection.md

MrAlias · 2020-06-05T19:58:03Z

text/0111-auto-resource-detection.md

+## Open questions
+
+- Does this interfere with any other upcoming specification changes related to resources?
+- If custom detectors need to live outside the core repo, what is the expectation regarding where they should be hosted?


I think this could follow the same model we have for vendor exporters, in that they are hosted in a vendor repository.

👍 I agree that would make sense. Will update based on the outcome of the discussion above.

text/0111-auto-resource-detection.md

AndrewAXue · 2020-06-23T16:57:34Z

text/0111-auto-resource-detection.md

+
+## Trade-offs and mitigations
+
+- In the case of an error at resource detection time, another alternative would be to start a background thread to retry following some strategy, but it's not clear that there would be much value in doing this, and it would add considerable unnecessary complexity.


I don't believe this will work, since you state that "the resources must be merged in the order the detectors were added". If there were a detector B after the one that failed, call it A, B would have to wait for A's detection to be complete. So there's no real value in making A do detection in a background thread if it blocks B anyways.

…racer/meter providers, clarified default resource detection in more detail, and added more points to the trade-offs & mitigations section

text/0111-auto-resource-detection.md

jkwatson · 2020-07-01T20:58:30Z

text/0111-auto-resource-detection.md

+
+```go
+type Detector interface {
+    Detect(ctx context.Context) (*Resource, error)


What is the context used for here?

This is just a go specific thing, where context always has to be passed around manually. For most other languages, ignore it. I should probably change these snippets to psuedo-code.

The telemetry context (which is indistinguishable from Go's context because it got re-used for that) is very meaningful for OpenTelemetry, so requiring it or not is an important distinction.

I suggest we remove it from here, as it seems to be a Go-specific thing ;)

The telemetry context (which is indistinguishable from Go's context because it got re-used for that) is very meaningful for OpenTelemetry, so requiring it or not is an important distinction.

That is a good point and I had not thought about that. I had only included the context in this code snippet for the regular Go usage of context. I don't think there's any requirement to pass the telemetry context into this function, but I'm happy to learn if there is a use case for that.

I suggest we remove it from here, as it seems to be a Go-specific thing ;)

I removed the code snippets (other than the Usage example) & replaced with slightly clearer wording.

jkwatson · 2020-07-01T21:33:26Z

FYI, in the java project, we have something similar for auto-detecting AWS resource: https://github.com/open-telemetry/opentelemetry-java/blob/master/sdk_extensions/aws_v1_support/src/main/java/io/opentelemetry/sdk/extensions/trace/aws/resource/AwsResource.java

carlosalberto · 2020-07-07T14:49:45Z

Just as @MrAlias I think this is great progress, and we can iron out the details as we go.

james-bebbington · 2020-07-09T08:12:11Z

This PR currently has three approvals. Before we merge this, I'd be keen to get more opinions on these important questions to inform how I write up the specification:

Is it okay for vendor specific detectors to be included in the SDK?
Do we want to run any/all resource detection by default?

For number 2 I could see a strong case being made for either way. As per @jrcamp's comments above, for the majority of customers & vendors, I can imagine that running AWS + Azure + GCP detection by default on startup would be extremely useful. If we implement this, however, it would be difficult to ever remove it.

Oberon00 · 2020-07-09T10:00:13Z

I have a fundamental question for this PR: Is a new SDK-package-level API really needed for auto-detecting resources? I think resource detection should happen in contrib packages, so that you can do (pseudo-Java):

OpenTelemetry.setupTracing(AwsAutoDetector.detect().merge(K8SAutoDetector.detect()).merge(...));

It would be nice if K8SAutoDetector, AwsAutoDetector, etc. would implement some interface and were themselves autdetected, but then you should still be able to do this in a contrib package like this:

OpenTelemetry.setupTracing(ClasspathAutoDetector.detectAndMergeAll());

The only crucial SDK-level thing is to have some setupTracing/setupResources function that is comfortable to use.

Oberon00 · 2020-07-09T10:03:12Z

It would also be great if vendor-SDKs could profit from this, so I would immediately vote for moving the Resource data structure (only) back to the API level (from the SDK level).

Oberon00 · 2020-07-09T10:04:43Z

text/0111-auto-resource-detection.md

+
+## Internal details
+
+As described above, the following will be added to the Resource SDK


As stated in #111 (comment), I'm not sure if this should be added directly to the Resource SDK spec. I think language SIGs should be encouraged to implement this as a separate, optional package.

Oberon00 · 2020-07-09T10:07:45Z

text/0111-auto-resource-detection.md

+Resource.
+
+The `Detect` function should contain a mechanism to timeout and cancel the
+request. If a detector is not able to detect a resource, it must return an


the request

Which request? First time the word "request" appears in this OTEP. I think resource detection must not perform network I/O, as resources are immutable and thus initializing resources must happen before telemetry can be collected.

One way to lift this restriction would be to specify a way to do this asynchronously, and delay the first call to export() until resources are detected. But this warrants a whole new OTEP. For now, I'd say anything which could require a timeout is out of the question as input for resources.

I'm not sure that this is something that needs to be in the spec, but at least in the Go SDK this timeout/cancel mechanism would be provided by the context referenced in this comment. I think that's why it would be important to include the context even if it were not being used as an OpenTelemetry Context.

Which request? First time the word "request" appears in this OTEP. I think resource detection must not perform network I/O, as resources are immutable and thus initializing resources must happen before telemetry can be collected.

Changed the wording to operation since not all detectors would not be creating any kind of request.

However, I don't think it's unreasonable for detectors to perform network I/O synchronously if it is expected to be (relatively) quick, especially if a user specifically pulls in that package. Indeed I think this would not be uncommon when looking up resource information: the convention by major cloud vendors is to provide instance metadata via http://169.254.169.254

One way to lift this restriction would be to specify a way to do this asynchronously, and delay the first call to export() until resources are detected. But this warrants a whole new OTEP. For now, I'd say anything which could require a timeout is out of the question as input for resources.

I don't think this would be worth the complexity. We'd have to make sure the resource information is applied to any "pending" telemetry before being exported which would not be trivial.

Good point with it being the users choice. I just hope that most resources can be provided via things like environment variables or local files. For example, cloud instance metadata could probably queried by the backend if the resource contains some sort of instance ID?

Oberon00 · 2020-07-09T10:20:32Z

text/0111-auto-resource-detection.md

+In the case of one or more detectors raising an error, there are two reasonable
+options:
+
+1. Ignore that detector, and continue with a warning (likely meaning we will


Please compare https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/error-handling.md

I don't think there's anything here that contradicts those guidelines given that detection code runs on initialization.

Usually we would "assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application", but given that this code runs on initialization, this seems to come under pont no. 2: "The API or SDK may fail fast and cause the application to fail on initialization".

Having the detection code fail fast, and leaving it up to the user to handle this if desired seems reasonable?

Sorry, that was a rather lazy and vague comment from me 😄 I guess what struck me as odd here is in what detail you describe error handling here. I think most of the specification does at most say "must not throw exceptions" and otherwise this document is the only thing that has more details.

james-bebbington · 2020-07-10T00:28:04Z

I have a fundamental question for this PR: Is a new SDK-package-level API really needed for auto-detecting resources? I think resource detection should happen in contrib packages, so that you can do (pseudo-Java):

...

That's a really good point. I have thought about this comment for quite a while now and can't think of any great counter arguments. I suppose the spec could still provide guidelines on how detect functions should be implemented? There is also still the question of which detectors the SDK should provide and/or run by default.

It would also be great if vendor-SDKs could profit from this, so I would immediately vote for moving the Resource data structure (only) back to the API level (from the SDK level).

On a related note, I notice that the Span & Metric Exporter interfaces exist in the SDK rather than the API so the OpenTelemetry exporters can't be re-used by vendor-SDKs. To me that seems like a similar, but larger, issue?

SergeyKanzhelev · 2020-07-14T23:47:30Z

Approvals from 3 companies and 2 TC members. Merging. @james-bebbington mentioned that some open comments might be addressed when merged into specification repo

james-bebbington requested review from arminru, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, tedsuo, tigrannajaryan and yurishkuro as code owners June 3, 2020 13:26

james-bebbington force-pushed the resource-autodetection branch 2 times, most recently from 9a6acca to c9e94b7 Compare June 3, 2020 13:30

fbogsany mentioned this pull request Jun 3, 2020

Add support for automated resource detection open-telemetry/opentelemetry-ruby#263

Merged

james-bebbington mentioned this pull request Jun 3, 2020

Auto detect the host resource and apply to metrics generated by receivers open-telemetry/opentelemetry-collector#871

Closed

nilebox reviewed Jun 3, 2020

View reviewed changes

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

jrcamp reviewed Jun 4, 2020

View reviewed changes

jrcamp mentioned this pull request Jun 4, 2020

Auto detect resource info open-telemetry/opentelemetry-go#785

Closed

Added auto resource detection proposal

57178d7

james-bebbington force-pushed the resource-autodetection branch from c9e94b7 to 57178d7 Compare June 4, 2020 01:46

yurishkuro reviewed Jun 4, 2020

View reviewed changes

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

MrAlias approved these changes Jun 5, 2020

View reviewed changes

Removed resource provider concept as per review comments

a829129

james-bebbington mentioned this pull request Jun 8, 2020

Monitoring: Metrics don't appear under any resource type GoogleCloudPlatform/opentelemetry-operations-js#79

Closed

jmacd approved these changes Jun 9, 2020

View reviewed changes

text/0111-auto-resource-detection.md Show resolved Hide resolved

This was referenced Jun 10, 2020

Go exporter should detect GKE resource labels automatically census-ecosystem/opencensus-go-exporter-stackdriver#261

Open

Resource Detection processor open-telemetry/opentelemetry-collector-contrib#309

Merged

MrAlias mentioned this pull request Jun 18, 2020

Auto detection of GKE resource open-telemetry/opentelemetry-go-contrib#49

Closed

AndrewAXue reviewed Jun 23, 2020

View reviewed changes

AndrewAXue mentioned this pull request Jun 25, 2020

Resources prototype open-telemetry/opentelemetry-python#853

Merged

Changed the proposal back to separating resource detection from the t…

18155e6

…racer/meter providers, clarified default resource detection in more detail, and added more points to the trade-offs & mitigations section

Oberon00 reviewed Jun 30, 2020

View reviewed changes

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

AndrewAXue mentioned this pull request Jul 1, 2020

Seperate resource detection from the exporters into seperate packages GoogleCloudPlatform/opentelemetry-operations-python#12

Closed

jkwatson reviewed Jul 1, 2020

View reviewed changes

fbogsany mentioned this pull request Jul 2, 2020

SDK: should Resource initialization use env vars only? open-telemetry/opentelemetry-ruby#99

Closed

carlosalberto approved these changes Jul 7, 2020

View reviewed changes

james-bebbington force-pushed the resource-autodetection branch from 1480123 to 8fd8842 Compare July 9, 2020 08:03

james-bebbington requested a review from a team July 9, 2020 08:03

Oberon00 reviewed Jul 9, 2020

View reviewed changes

Wrap lines

7bbfeae

james-bebbington force-pushed the resource-autodetection branch from 8fd8842 to 7bbfeae Compare July 10, 2020 01:00

SergeyKanzhelev approved these changes Jul 14, 2020

View reviewed changes

Merge branch 'master' into resource-autodetection

0eed8a7

SergeyKanzhelev merged commit 5e5b2a2 into open-telemetry:master Jul 14, 2020

dyladan mentioned this pull request Jul 15, 2020

Resource auto-detection OTEP merged open-telemetry/opentelemetry-js#1317

Closed

dyladan mentioned this pull request Jul 29, 2020

Hosting vendor supported exporters for OpenTelemetry open-telemetry/opentelemetry-js#1335

Closed

TommyCpp mentioned this pull request Aug 12, 2020

Add resource detector open-telemetry/opentelemetry-rust#174

Merged

james-bebbington mentioned this pull request Aug 15, 2020

Added resource detection spec open-telemetry/opentelemetry-specification#811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added auto resource detection proposal #111

Added auto resource detection proposal #111

james-bebbington commented Jun 3, 2020 •

edited

Loading

jrcamp Jun 4, 2020

james-bebbington Jun 4, 2020 •

edited

Loading

jrcamp Jun 4, 2020

yurishkuro Jun 4, 2020

james-bebbington Jun 8, 2020

MrAlias left a comment

MrAlias Jun 5, 2020

james-bebbington Jun 8, 2020

AndrewAXue Jun 23, 2020

jkwatson Jul 1, 2020

james-bebbington Jul 4, 2020

Oberon00 Jul 7, 2020

carlosalberto Jul 7, 2020

james-bebbington Jul 9, 2020

jkwatson commented Jul 1, 2020

carlosalberto commented Jul 7, 2020

james-bebbington commented Jul 9, 2020 •

edited

Loading

Oberon00 commented Jul 9, 2020 •

edited

Loading

Oberon00 commented Jul 9, 2020 •

edited

Loading

Oberon00 Jul 9, 2020 •

edited

Loading

Oberon00 Jul 9, 2020 •

edited

Loading

Aneurysm9 Jul 9, 2020

james-bebbington Jul 10, 2020 •

edited

Loading

Oberon00 Jul 10, 2020

Oberon00 Jul 9, 2020

james-bebbington Jul 10, 2020

Oberon00 Jul 10, 2020

james-bebbington commented Jul 10, 2020

SergeyKanzhelev commented Jul 14, 2020


		In order to apply auto-detected resource information to traces & metrics, a user will need to:

		- Configure which resource detector(s) they would like to run (e.g. AWS EC2 detector)


		## Trade-offs and mitigations

		- In the case of an error at resource detection time, another alternative would be to start a background thread to retry following some strategy, but it's not clear that there would be much value in doing this, and it would add considerable unnecessary complexity.


		## Internal details

		As described above, the following will be added to the Resource SDK

Added auto resource detection proposal #111

Added auto resource detection proposal #111

Conversation

james-bebbington commented Jun 3, 2020 • edited Loading

Choose a reason for hiding this comment

james-bebbington Jun 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MrAlias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkwatson commented Jul 1, 2020

carlosalberto commented Jul 7, 2020

james-bebbington commented Jul 9, 2020 • edited Loading

Oberon00 commented Jul 9, 2020 • edited Loading

Oberon00 commented Jul 9, 2020 • edited Loading

Oberon00 Jul 9, 2020 • edited Loading

Choose a reason for hiding this comment

Oberon00 Jul 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-bebbington commented Jul 10, 2020

SergeyKanzhelev commented Jul 14, 2020

james-bebbington commented Jun 3, 2020 •

edited

Loading

james-bebbington Jun 4, 2020 •

edited

Loading

james-bebbington commented Jul 9, 2020 •

edited

Loading

Oberon00 commented Jul 9, 2020 •

edited

Loading

Oberon00 commented Jul 9, 2020 •

edited

Loading

Oberon00 Jul 9, 2020 •

edited

Loading

Oberon00 Jul 9, 2020 •

edited

Loading

james-bebbington Jul 10, 2020 •

edited

Loading