Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conceptual issue with input.exec (interval vs timeout due to execution time) #2629

Closed
cha87de opened this issue Apr 5, 2017 · 9 comments
Closed

Comments

@cha87de
Copy link

cha87de commented Apr 5, 2017

I'm trying to add metrics to the TICK stack. To avoid compile dependencies I'm using the input.exec plugin. This way, I want to call an external binary every 10s, this binary makes observations, and concludes with some calculations and an output according to the influx data format.

Here comes my conceptual problem: if I want to avoid gaps in the monitoring, I have to run this binary every 10s, which observes for 10s and produces the output. Producing the output takes 0.1s. Hence, the binary runs for 10.1s and hence needs a timeout of 10.1s.

Unfortunately, telegraf doesn't allow me to set the timeout larger than the interval. So, I probably have some conceptual mistake in my approach of using the input.exec plugin. I understand that telegraf want's to avoid that one plugin runs twice for a time frame, but I'm not sure how to avoid that.

So, the question I would have is: what is the intended way by telegraf to integrate an external monitoring tool which which needs time to a) observe and b) compile output in the correct way, without having monitoring gaps?

@danielnelson
Copy link
Contributor

Producing the output takes 0.1s. Hence, the binary runs for 10.1s and hence needs a timeout of 10.1s.

If the output takes 0.1s, then the binary runs for 0.1s. The timeout could be any number larger than this. I would time it out every 5s if you are targeting a 10s interval.

@phemmer
Copy link
Contributor

phemmer commented Apr 6, 2017

@danielnelson The tool takes 10 seconds to gather data. After gather it only takes 0.1s to process the data and dump it out.

Also, see #2087 which would allow you to run a process which just stays running indefinitely, and emits whenever it wants to. Sounds exactly like what is being asked for.

@cha87de
Copy link
Author

cha87de commented Apr 6, 2017

The conceptual thing I'm doing wrong is, that I want to do some pre-processing instead of simply dumping raw data like system counters into the TICK stack. This pre-processing takes time (0.1s) and is based on data I first have to gather (10s interval). So @danielnelson the processing takes 0.1s but the execution time of my binary (will call it adaptor now) takes (I will repeat myself) 10.1s. The question towards the telegraf developers is imho, if such pre-processing tasks are allowed or should not happen at all - since the raw data can be post-processed via InfluxDB.

If telegraf want's to support such adaptors with pre-processing, I see two approaches:

  • Allow to run one adaptor twice at a time so the adaptor can have an overlap in time, meaning to allow timeout > interval and start the adaptor when the interval passed although there's still another instance of this adaptor running
  • Allow infinite execution times of adaptors like suggested in exec input should be able to handle long-running commands #2087 and simply restart the adaptor in case it stopped.

@danielnelson
Copy link
Contributor

On the topic of if you should preprocess data, Telegraf doesn't really have a opinion. It is your data and you know it best. That said, if you can do the queries in InfluxDB it might be more flexible for querying, we usually favor this when writing input plugins.

We have some support for processors and aggregators, but there are very few currently implemented, and they run against all collected points.

Another great place to perform this type of action is in Kapacitor. You can position Kapacitor either before or after InfluxDB and it can perform advanced processing.

Otherwise, maybe you could split your executable into two stages, and pass information from the collector stage to the processor stage via file or other persistent storage.

I don't think we want to implement overlapping executions. #2087 sounds good but of course it's not implemented.

@cha87de
Copy link
Author

cha87de commented Apr 10, 2017

I will have a look at Kapacitor. Anyway, my binary should not be limited to the TICK stack, why I personally think a pre-processing inside an adaptor executed with input.exec should be supported.

Having a two staged execution will require a persisted, shared state between runs (e.g. on a file system) - something I wanted to avoid, to keep the logic simple and the required sources low.

Could you @danielnelson please elaborate a bit more on the statement "I don't think we want to implement overlapping executions."? Why do you think overlapping executions are not useful?

@phemmer
Copy link
Contributor

phemmer commented Apr 10, 2017

While I can't speak for @danielnelson (and I'm not a project member, so take my opinion with a grain of salt), I would agree with the idea that we shouldn't support overlapping runs. I'll try to explain my thoughts, but it's kinda hard to articulate.

No plugin currently supports it, and just conceptually I think it would be a bad idea. The only use case I can think for it is something like this, where the plugin has finished measuring, but takes some time to process data. But telegraf can't know that. You'd essentially be saying it's OK for telegraf to gather metrics for the same period twice (even though that's not what this specific example is doing).

I think in this specific example, a long running execution is better. It's entirely possible that due to simple CPU scheduling jitter, each time the external app is launched, it starts gathering data just slightly before, or just slightly after the 10s mark. This would result in a tiny overlap, or a tiny gap. To properly solve this the plugin would need to stay running, and handle some sort of atomic cutoff internally. Some way of ensuring that one monitoring period starts at the exact moment the previous one ends. In your case the previous one can still stay running after the cutoff, doing the data processing and output, but at the same time it's already started monitoring for the new period.
There are various ways you might accomplish this, but I can't provide examples without knowing more details of what you're doing.

@danielnelson
Copy link
Contributor

Yeah, I think overlapping executions sounds complicated to explain and implement, and not something that many people would use. The long running subprocess idea makes more sense to me, and reminds me of FCGI a bit.

My way of thinking about Telegraf is that it is not meant to be an advanced data processor. We try to hit the basics that most people will need, but for more advanced operations we recommend Kapacitor which excels at processing and can run a long lived process[1].

[1] https://github.com/influxdata/kapacitor/tree/master/udf/agent/

@danielnelson
Copy link
Contributor

@cha87de I hope one of the ideas we discussed will work for you, let me know if you have any more questions.

@cha87de
Copy link
Author

cha87de commented Apr 12, 2017

Yes, thank you for your feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants