conceptual issue with input.exec (interval vs timeout due to execution time) #2629

cha87de · 2017-04-05T12:16:46Z

I'm trying to add metrics to the TICK stack. To avoid compile dependencies I'm using the input.exec plugin. This way, I want to call an external binary every 10s, this binary makes observations, and concludes with some calculations and an output according to the influx data format.

Here comes my conceptual problem: if I want to avoid gaps in the monitoring, I have to run this binary every 10s, which observes for 10s and produces the output. Producing the output takes 0.1s. Hence, the binary runs for 10.1s and hence needs a timeout of 10.1s.

Unfortunately, telegraf doesn't allow me to set the timeout larger than the interval. So, I probably have some conceptual mistake in my approach of using the input.exec plugin. I understand that telegraf want's to avoid that one plugin runs twice for a time frame, but I'm not sure how to avoid that.

So, the question I would have is: what is the intended way by telegraf to integrate an external monitoring tool which which needs time to a) observe and b) compile output in the correct way, without having monitoring gaps?

danielnelson · 2017-04-05T17:53:33Z

Producing the output takes 0.1s. Hence, the binary runs for 10.1s and hence needs a timeout of 10.1s.

If the output takes 0.1s, then the binary runs for 0.1s. The timeout could be any number larger than this. I would time it out every 5s if you are targeting a 10s interval.

phemmer · 2017-04-06T00:40:34Z

@danielnelson The tool takes 10 seconds to gather data. After gather it only takes 0.1s to process the data and dump it out.

Also, see #2087 which would allow you to run a process which just stays running indefinitely, and emits whenever it wants to. Sounds exactly like what is being asked for.

cha87de · 2017-04-06T06:40:18Z

The conceptual thing I'm doing wrong is, that I want to do some pre-processing instead of simply dumping raw data like system counters into the TICK stack. This pre-processing takes time (0.1s) and is based on data I first have to gather (10s interval). So @danielnelson the processing takes 0.1s but the execution time of my binary (will call it adaptor now) takes (I will repeat myself) 10.1s. The question towards the telegraf developers is imho, if such pre-processing tasks are allowed or should not happen at all - since the raw data can be post-processed via InfluxDB.

If telegraf want's to support such adaptors with pre-processing, I see two approaches:

Allow to run one adaptor twice at a time so the adaptor can have an overlap in time, meaning to allow timeout > interval and start the adaptor when the interval passed although there's still another instance of this adaptor running
Allow infinite execution times of adaptors like suggested in exec input should be able to handle long-running commands #2087 and simply restart the adaptor in case it stopped.

danielnelson · 2017-04-06T18:16:31Z

On the topic of if you should preprocess data, Telegraf doesn't really have a opinion. It is your data and you know it best. That said, if you can do the queries in InfluxDB it might be more flexible for querying, we usually favor this when writing input plugins.

We have some support for processors and aggregators, but there are very few currently implemented, and they run against all collected points.

Another great place to perform this type of action is in Kapacitor. You can position Kapacitor either before or after InfluxDB and it can perform advanced processing.

Otherwise, maybe you could split your executable into two stages, and pass information from the collector stage to the processor stage via file or other persistent storage.

I don't think we want to implement overlapping executions. #2087 sounds good but of course it's not implemented.

cha87de · 2017-04-10T12:11:42Z

I will have a look at Kapacitor. Anyway, my binary should not be limited to the TICK stack, why I personally think a pre-processing inside an adaptor executed with input.exec should be supported.

Having a two staged execution will require a persisted, shared state between runs (e.g. on a file system) - something I wanted to avoid, to keep the logic simple and the required sources low.

Could you @danielnelson please elaborate a bit more on the statement "I don't think we want to implement overlapping executions."? Why do you think overlapping executions are not useful?

phemmer · 2017-04-10T12:26:02Z

While I can't speak for @danielnelson (and I'm not a project member, so take my opinion with a grain of salt), I would agree with the idea that we shouldn't support overlapping runs. I'll try to explain my thoughts, but it's kinda hard to articulate.

No plugin currently supports it, and just conceptually I think it would be a bad idea. The only use case I can think for it is something like this, where the plugin has finished measuring, but takes some time to process data. But telegraf can't know that. You'd essentially be saying it's OK for telegraf to gather metrics for the same period twice (even though that's not what this specific example is doing).

I think in this specific example, a long running execution is better. It's entirely possible that due to simple CPU scheduling jitter, each time the external app is launched, it starts gathering data just slightly before, or just slightly after the 10s mark. This would result in a tiny overlap, or a tiny gap. To properly solve this the plugin would need to stay running, and handle some sort of atomic cutoff internally. Some way of ensuring that one monitoring period starts at the exact moment the previous one ends. In your case the previous one can still stay running after the cutoff, doing the data processing and output, but at the same time it's already started monitoring for the new period.
There are various ways you might accomplish this, but I can't provide examples without knowing more details of what you're doing.

danielnelson · 2017-04-11T00:26:29Z

Yeah, I think overlapping executions sounds complicated to explain and implement, and not something that many people would use. The long running subprocess idea makes more sense to me, and reminds me of FCGI a bit.

My way of thinking about Telegraf is that it is not meant to be an advanced data processor. We try to hit the basics that most people will need, but for more advanced operations we recommend Kapacitor which excels at processing and can run a long lived process[1].

[1] https://github.com/influxdata/kapacitor/tree/master/udf/agent/

danielnelson · 2017-04-12T00:50:54Z

@cha87de I hope one of the ideas we discussed will work for you, let me know if you have any more questions.

cha87de · 2017-04-12T06:22:07Z

Yes, thank you for your feedback!

danielnelson closed this as completed Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conceptual issue with input.exec (interval vs timeout due to execution time) #2629

conceptual issue with input.exec (interval vs timeout due to execution time) #2629

cha87de commented Apr 5, 2017

danielnelson commented Apr 5, 2017

phemmer commented Apr 6, 2017 •

edited

Loading

cha87de commented Apr 6, 2017

danielnelson commented Apr 6, 2017

cha87de commented Apr 10, 2017

phemmer commented Apr 10, 2017 •

edited

Loading

danielnelson commented Apr 11, 2017

danielnelson commented Apr 12, 2017

cha87de commented Apr 12, 2017

conceptual issue with input.exec (interval vs timeout due to execution time) #2629

conceptual issue with input.exec (interval vs timeout due to execution time) #2629

Comments

cha87de commented Apr 5, 2017

danielnelson commented Apr 5, 2017

phemmer commented Apr 6, 2017 • edited Loading

cha87de commented Apr 6, 2017

danielnelson commented Apr 6, 2017

cha87de commented Apr 10, 2017

phemmer commented Apr 10, 2017 • edited Loading

danielnelson commented Apr 11, 2017

danielnelson commented Apr 12, 2017

cha87de commented Apr 12, 2017

phemmer commented Apr 6, 2017 •

edited

Loading

phemmer commented Apr 10, 2017 •

edited

Loading