Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the input plugin http_response to generate Prometheus friendly metrics #3803

Closed
mirath opened this issue Feb 17, 2018 · 5 comments
Closed
Assignees
Labels
feature request Requests for new plugin and for new features to existing plugins
Milestone

Comments

@mirath
Copy link
Contributor

mirath commented Feb 17, 2018

Feature Request

Right now, the http_response plugin sets the HTTP return code (200, 301, 502, etc) as a numeric field. This causes Prometheus (which doesn't have the concept of fields) to generate a new metric wich value is the integer value of the HTTP response.

This not only generates another metric to ingest by Prometheus, but also makes it really cumbersome to run queries against this data, as this type of information is better represented as a Tag in Prometheus

Additionally, when a timeout occurs, the plugin setups only one field, the string field result_type, which causes the Prometheus output plugin to transform it to a tag (as there are no string values in Prometheus metrics outside of tags). This results in a metric with no fields, which causes de Prometheus output plugin to drop the metric, as Prometheus metrics cannot exist without a value.

Proposal:

  1. Add an option to add the http_response_code and result_type fields as tags, so they can be easily aggregated over in Prometheus databases

  2. Add an option to always set the response_time field, so the plugin always returns a numeric value, with optional options to specify what value to set this field to in case of timeout (-1, 1000, actual time spent waiting for a response, etc)

Current behavior:

Currently metrics received from the http_response input plugin are not Prometheus friendly. HTTP return codes are difficult to aggregate in Prometheus and no metrics are received when a timeout occurs

Desired behavior:

Metric received from the http_response input plugin by a Prometheus database should be easy to aggregate by HTTP return codes, and metrics should still be received in case of a timeout

Use case: [Why is this important (helps with prioritizing requests)]

Prometheus is currently an important and popular TSDB, and measuring return codes and latency of specific URLs is an important and very useful black box monitoring technique. This feature would make it easier for Prometheus users to keep logs of latency, uptime and incidents of their services using Telegraf

@mirath
Copy link
Contributor Author

mirath commented Feb 17, 2018

If possible, I'm willing to work on this feature

@danielnelson
Copy link
Contributor

Maybe we add http_response_code as a status_code tag, we will want a new name anyway to ease the transition without needing to select tag vs field. Going with this we could add result_type as result tag.

I suggest not adding response_time always since it won't always have a value, instead how about we add result_code and define it as an integer representation of result_type?

There is a fair bit of field duplication going on if we do this, so ultimately we should deprecate the string fields in favor of the tag + int field. This just means we comment it in the README.

To further explain the reasoning above, we have been trying to figure out the best way to represent enumerations so that they will look good in Grafana/Chronograf and in multiple databases. I think what we are arriving at is to have a tag with a string description and a field with a numeric representation. The numeric representation is mostly for two reasons:

  • You must have at least one field value, so this fulfills this in the case that there are no other fields
  • Gives you a value that you can graph as a single series in Grafana/Chronograf

Let me know what you think of the scheme and how well it works out with prometheus.

@danielnelson danielnelson added the feature request Requests for new plugin and for new features to existing plugins label Feb 20, 2018
@mirath
Copy link
Contributor Author

mirath commented Feb 21, 2018

I'm having a hard time following your suggestion. Let me see if I got it right:

  • We would add a TAG status_code that would hold the HTTP status code as a string. This tag would exist together with the int field http_response_code
  • We would add a TAG result that would hold a string representation of what is currently the result_type field.
  • The field FIELD result_type would be transformed into a numeric representation of the newly added TAG result

In the end we would have a newly added result tag with a related result_type int field, and a status_code tag with a related http_response_code field

Did I understand you correctly?

@danielnelson
Copy link
Contributor

One minor correction, the result_type field would remain as is but we add result_code.

Here is a diff of what I'm thinking:

- http_response,method=GET,server=http://www.github.com http_response_code=200i,response_time=6.223266528,result_type="success" 1459419354977857955
+ http_response,method=GET,server=http://www.github.com,status_code=200,result=success http_response_code=200i,response_time=6.223266528,result_type="success",result_code=0i 1459419354977857955

@mirath
Copy link
Contributor Author

mirath commented Feb 21, 2018

Ok, I understand now.

As a whole, as you said, there is a lot of field duplication going on. I guess that is required for the time being to avoid breaking changes, and come 2.0 you will drop the string fields as part of your new philosophy regarding the representation of enumerations.

From a prometheus standpoint, the ideal would be only one field with everything else represented in tags, so the extra fields for enumerations (result_type and result_code) go unused even though they would be ingested. I'm not sure what would the "Prometheus way" of representing latency be, since latency can be "null" (in case of timeout or error), I guess I would use NaN when no response is received.

If possible, I'd like a way to deactivate the generation of these fields to reduce the amount of generated and ingested metrics by Prometheus, and be able to set a default numeric representation of "null" value for the latency. But I understand if this additional feature does not align with the project

Now, I haven't used Chronograf/InfluxDB, so take what I'm about to say with a grain of salt.

I don't see the usefulness of representing enumerations as an integer value. You mentioned two points:

  • You must have at least one field value, so this fulfills this in the case that there are no other fields: Requiring at least one field value strikes me as a good idea, but this raises a big issue: what about null values (such as in latency, what to return when there is a network error). Ideally there would be a null representation for each target database, but that might not always be the case. For databases without a reprentation of null, I would add a tag of of the form null=(yes|no), and let each output mantainer handle it the best way they deem correct, with the requirement that dropping the metric only be used as a last resort (dropping potentially useful information should only be done if no other option is available)

  • Gives you a value that you can graph as a single series in Grafana/Chronograf: I honestly don't see too much value here. In Prometheus we translate booleans to 0,1, and sometimes graphing them is useful when creating alerts in Grafana, but for the general case I don't see much value as much of the operations you do in a graph (sum, avg, max, etc) do not have much use when operating in a discrete space where states cannot be easily aggregated (ie. It does not make any sense to sum over 5min the HTTP status codes of a website). It is useful to look at some state changes over time, but I feel that to properly track that a custom abstraction is needed. However if, for now, this is the only way we have to track state changes on some metric, I guess dropping or limiting this use case does more harm than good.

PS. I'll get working on these changes this weekend

@mirath mirath mentioned this issue Feb 21, 2018
3 tasks
@danielnelson danielnelson added this to the 1.6.0 milestone Mar 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

2 participants