Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added auto resource detection proposal #111

Merged

Conversation

james-bebbington
Copy link
Member

@james-bebbington james-bebbington commented Jun 3, 2020

This OTEP proposes a mechanism to support automatic detection of resource information, including a default implementation to detect resource information from an environment variable. This is largely based on existing work in OpenCensus & the current implementation is OpenTelemetry JS SDK.

Related issues/PRs:

Note this is my first time creating an OTEP. I'm not aware of all the previous discussion on this topic (couldn't find any directly related previous work), but figured I'd go ahead and create an OTEP to start a discussion at least, as it seems like there is currently a gap in the specification.

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

In order to apply auto-detected resource information to traces & metrics, a user will need to:

- Configure which resource detector(s) they would like to run (e.g. AWS EC2 detector)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is explicit configuration required? Could we run all cloud providers and take the first one that succeeds? If none succeed then the Detect() call fails.

Copy link
Member Author

@james-bebbington james-bebbington Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - I probably should have added this as an Open question.

I've suggested a similar approach to what was in the OpenCensus spec which dicates that cloud-vendor specific detection must not be included in the SDK (and thus would require some configuration). But that may not be necessary. As you point out below, the code to lookup host / container metadata is usually trivial: lookup an environment variable(s) or get a text response from a known endpoint.

If we do allow cloud vendor-specific detectors to be included in the SDK that raises a few related questions:

  • Do we want to enforce that vendor libraries cannot be included in detectors?
  • Are we open to any cloud providers adding detection code here as long as they follow the above rule? (other than just major ones)
  • Do we attempt detection for all cloud providers by default? (including non-major ones)
  • If yes, then are there any cases where we would need to let users override this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to enforce that vendor libraries cannot be included in detectors?

Yes IMO

Are we open to any cloud providers adding detection code here as long as they follow the above rule? (other than just major ones)

Yes as long as they add it to OT libraries for all languages.

Do we attempt detection for all cloud providers by default? (including non-major ones)
If yes, then are there any cases where we would need to let users override this?

Yeah this does seem like it could be tricky given the number of cloud providers and that they're on the rise. Could limit it by some factors, like "must have X% market share" to be included in autodetection but any legitimate provider can still add a provider as a fallback if they don't meet that threshold?

I think ideally we would run all autodetectors in parallel with a relatively small time limit (we use 1s by default in SignalFx agent but I don't have any hard data supporting what the correct value should be).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would start with requiring explicit enumeration of detectors by the user. It can always be extended to an auto-detector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the overarching question here is:

"Is it okay for cloud-vendor specific detectors to live in the SDK directly as long as they don't include any third party libraries?"

I'll wait for a couple more opinions before updating the OTEP but if there is general consensus that this is okay, then I'll get rid of the notes about custom detectors living in separate packages.

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Show resolved Hide resolved
Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in support of this proposal.

I think the details need to be ironed out, but I like the general approach.

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
## Open questions

- Does this interfere with any other upcoming specification changes related to resources?
- If custom detectors need to live outside the core repo, what is the expectation regarding where they should be hosted?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could follow the same model we have for vendor exporters, in that they are hosted in a vendor repository.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree that would make sense. Will update based on the outcome of the discussion above.

text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Show resolved Hide resolved

## Trade-offs and mitigations

- In the case of an error at resource detection time, another alternative would be to start a background thread to retry following some strategy, but it's not clear that there would be much value in doing this, and it would add considerable unnecessary complexity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this will work, since you state that "the resources must be merged in the order the detectors were added". If there were a detector B after the one that failed, call it A, B would have to wait for A's detection to be complete. So there's no real value in making A do detection in a background thread if it blocks B anyways.

…racer/meter providers, clarified default resource detection in more detail, and added more points to the trade-offs & mitigations section
text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved
text/0111-auto-resource-detection.md Outdated Show resolved Hide resolved

```go
type Detector interface {
Detect(ctx context.Context) (*Resource, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the context used for here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a go specific thing, where context always has to be passed around manually. For most other languages, ignore it. I should probably change these snippets to psuedo-code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telemetry context (which is indistinguishable from Go's context because it got re-used for that) is very meaningful for OpenTelemetry, so requiring it or not is an important distinction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we remove it from here, as it seems to be a Go-specific thing ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telemetry context (which is indistinguishable from Go's context because it got re-used for that) is very meaningful for OpenTelemetry, so requiring it or not is an important distinction.

That is a good point and I had not thought about that. I had only included the context in this code snippet for the regular Go usage of context. I don't think there's any requirement to pass the telemetry context into this function, but I'm happy to learn if there is a use case for that.

I suggest we remove it from here, as it seems to be a Go-specific thing ;)

I removed the code snippets (other than the Usage example) & replaced with slightly clearer wording.

@jkwatson
Copy link
Contributor

jkwatson commented Jul 1, 2020

FYI, in the java project, we have something similar for auto-detecting AWS resource: https://github.com/open-telemetry/opentelemetry-java/blob/master/sdk_extensions/aws_v1_support/src/main/java/io/opentelemetry/sdk/extensions/trace/aws/resource/AwsResource.java

@carlosalberto
Copy link
Contributor

Just as @MrAlias I think this is great progress, and we can iron out the details as we go.

@james-bebbington
Copy link
Member Author

james-bebbington commented Jul 9, 2020

This PR currently has three approvals. Before we merge this, I'd be keen to get more opinions on these important questions to inform how I write up the specification:

  1. Is it okay for vendor specific detectors to be included in the SDK?
  2. Do we want to run any/all resource detection by default?

For number 2 I could see a strong case being made for either way. As per @jrcamp's comments above, for the majority of customers & vendors, I can imagine that running AWS + Azure + GCP detection by default on startup would be extremely useful. If we implement this, however, it would be difficult to ever remove it.

@Oberon00
Copy link
Member

Oberon00 commented Jul 9, 2020

I have a fundamental question for this PR: Is a new SDK-package-level API really needed for auto-detecting resources? I think resource detection should happen in contrib packages, so that you can do (pseudo-Java):

OpenTelemetry.setupTracing(AwsAutoDetector.detect().merge(K8SAutoDetector.detect()).merge(...));

It would be nice if K8SAutoDetector, AwsAutoDetector, etc. would implement some interface and were themselves autdetected, but then you should still be able to do this in a contrib package like this:

OpenTelemetry.setupTracing(ClasspathAutoDetector.detectAndMergeAll());

The only crucial SDK-level thing is to have some setupTracing/setupResources function that is comfortable to use.

@Oberon00
Copy link
Member

Oberon00 commented Jul 9, 2020

It would also be great if vendor-SDKs could profit from this, so I would immediately vote for moving the Resource data structure (only) back to the API level (from the SDK level).


## Internal details

As described above, the following will be added to the Resource SDK
Copy link
Member

@Oberon00 Oberon00 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in #111 (comment), I'm not sure if this should be added directly to the Resource SDK spec. I think language SIGs should be encouraged to implement this as a separate, optional package.

Resource.

The `Detect` function should contain a mechanism to timeout and cancel the
request. If a detector is not able to detect a resource, it must return an
Copy link
Member

@Oberon00 Oberon00 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the request

Which request? First time the word "request" appears in this OTEP. I think resource detection must not perform network I/O, as resources are immutable and thus initializing resources must happen before telemetry can be collected.

One way to lift this restriction would be to specify a way to do this asynchronously, and delay the first call to export() until resources are detected. But this warrants a whole new OTEP. For now, I'd say anything which could require a timeout is out of the question as input for resources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is something that needs to be in the spec, but at least in the Go SDK this timeout/cancel mechanism would be provided by the context referenced in this comment. I think that's why it would be important to include the context even if it were not being used as an OpenTelemetry Context.

Copy link
Member Author

@james-bebbington james-bebbington Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which request? First time the word "request" appears in this OTEP. I think resource detection must not perform network I/O, as resources are immutable and thus initializing resources must happen before telemetry can be collected.

Changed the wording to operation since not all detectors would not be creating any kind of request.

However, I don't think it's unreasonable for detectors to perform network I/O synchronously if it is expected to be (relatively) quick, especially if a user specifically pulls in that package. Indeed I think this would not be uncommon when looking up resource information: the convention by major cloud vendors is to provide instance metadata via http://169.254.169.254

One way to lift this restriction would be to specify a way to do this asynchronously, and delay the first call to export() until resources are detected. But this warrants a whole new OTEP. For now, I'd say anything which could require a timeout is out of the question as input for resources.

I don't think this would be worth the complexity. We'd have to make sure the resource information is applied to any "pending" telemetry before being exported which would not be trivial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point with it being the users choice. I just hope that most resources can be provided via things like environment variables or local files. For example, cloud instance metadata could probably queried by the backend if the resource contains some sort of instance ID?

In the case of one or more detectors raising an error, there are two reasonable
options:

1. Ignore that detector, and continue with a warning (likely meaning we will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything here that contradicts those guidelines given that detection code runs on initialization.

Usually we would "assume that users would prefer to lose telemetry data rather than have the library significantly change the behavior of the instrumented application", but given that this code runs on initialization, this seems to come under pont no. 2: "The API or SDK may fail fast and cause the application to fail on initialization".

Having the detection code fail fast, and leaving it up to the user to handle this if desired seems reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was a rather lazy and vague comment from me 😄 I guess what struck me as odd here is in what detail you describe error handling here. I think most of the specification does at most say "must not throw exceptions" and otherwise this document is the only thing that has more details.

@james-bebbington
Copy link
Member Author

I have a fundamental question for this PR: Is a new SDK-package-level API really needed for auto-detecting resources? I think resource detection should happen in contrib packages, so that you can do (pseudo-Java):

...

That's a really good point. I have thought about this comment for quite a while now and can't think of any great counter arguments. I suppose the spec could still provide guidelines on how detect functions should be implemented? There is also still the question of which detectors the SDK should provide and/or run by default.

It would also be great if vendor-SDKs could profit from this, so I would immediately vote for moving the Resource data structure (only) back to the API level (from the SDK level).

On a related note, I notice that the Span & Metric Exporter interfaces exist in the SDK rather than the API so the OpenTelemetry exporters can't be re-used by vendor-SDKs. To me that seems like a similar, but larger, issue?

@SergeyKanzhelev
Copy link
Member

Approvals from 3 companies and 2 TC members. Merging. @james-bebbington mentioned that some open comments might be addressed when merged into specification repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.