Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus support on v1/sys/metrics endpoint #5308

Merged
merged 12 commits into from
Feb 14, 2019

Conversation

uepoch
Copy link
Contributor

@uepoch uepoch commented Sep 10, 2018

Add support for Prometheus and InMemSink described in #5223

This add supports for a Prometheus Sink in Telemetry config, available by specifying a prometheus_retention_time key in telemetry configuration, similar as in consul or nomad.

To pass the information through the config and server command to the http handler, it expands the HandlerProps to pass inmemory and potential prometheus retention time.

Also made the choices to:

  • support parsing the key from a string since Nanoseconds are hard to understand for humans, and prometheus regulars retentions are more in seconds.
  • Enable prometheus in dev-mode

This does not forward anything to leader, each node has its own sinks.

  • Big vendoring of prometheus lib

@uepoch
Copy link
Contributor Author

uepoch commented Sep 24, 2018

@jefferai, sorry for highlighting, do you think this is enough for #5223 ?
We've been using it in our infra for 1-2 weeks no big problems, except for go-metrics hostname handling when you're not passing disable_hostname in the telemetry config hashicorp/go-metrics#83

@jefferai
Copy link
Member

I have not looked at the code, but one thing that was surfaced to me by a team member is that this uses an unauthenticated endpoint. The endpoint must be authenticated.

@uepoch
Copy link
Contributor Author

uepoch commented Sep 25, 2018

Prometheus does not support passing arbitrary headers (hence X-Vault-Token)

After searching, a PR has been made on Consul to be compliant with RFC6750 and allow Authorization Bearer headers to be used in place of X-Consul-Token
hashicorp/consul#4502
Do you think it's something you'll want to support in the future ?

@jefferai
Copy link
Member

Makes sense to me; @chrishoffman @briankassouf thoughts?

@jefferai
Copy link
Member

Note for the team: if we support this we need to be sure to blacklist Authorization from being a passthrough request header.

@briankassouf
Copy link
Contributor

Sounds good to me too

@uepoch
Copy link
Contributor Author

uepoch commented Nov 12, 2018

Hello again, small update:

So #5397 has been merged so I guess it could be used

However, when discussing at Hashiconf with @jefferai, you mentioned that metrology informations are sensitive, so should not be open in the wild, and we discussed about maybe adding it behind a separate listener with CIDR limitations, would that still be ok ?

This would ease the integration with third-party monitoring tools

@jefferai
Copy link
Member

I think if it's behind a token additional CIDR restrictions aren't necessary, especially given you can pop CIDR restrictions on tokens as well. @briankassouf agree?

@briankassouf
Copy link
Contributor

Yep, now that it is behind a token we should be okay to allow these to be exported through the API

@uepoch
Copy link
Contributor Author

uepoch commented Dec 12, 2018

The point of binding to another listener was to avoid leaking potentially sensitive informations (since one may consider vault's metrics sensitive) and still be usable by regular monitoring system without having to authenticate.
It's been a while since we talked about it tbh so I'm not 100% clear about the details but that seemed acceptable for you to allow the endpoint to be unauthentified, if behind a separate listener for which the user can provide CIDR limitations

It's not impossible for prometheus to hit with a vault token, but would complexify existing setups for all users as it has no logic by itself to renew a provided token

What do you think @jefferai @briankassouf ?

command/server.go Outdated Show resolved Hide resolved
command/server.go Outdated Show resolved Hide resolved
http/sys_metrics.go Outdated Show resolved Hide resolved
http/sys_metrics.go Outdated Show resolved Hide resolved
command/server/config.go Outdated Show resolved Hide resolved
command/server/config.go Outdated Show resolved Hide resolved
@ncabatoff
Copy link
Contributor

Prometheus may not have a builtin way to renew a provided token, but it has the bearer_token_file config option, so something external like a cronjob can handle renewal.

Why not proceed with the existing PR under the assumption that the token is required, and use another issue for the tokenless CIDR-restricted listener idea? That way we can at least get the work you've already done merged, rather than have it blocked by a more contentious extra feature.

I tried applying the changes on the tip of master and was able to monitor Vault using Prometheus with bearer token auth. As far as I'm concerned it looks almost ready to go.

Thanks for doing this by the way, I love Prometheus and look forward to being able to monitor Vault with it!

@evanmcclure
Copy link

evanmcclure commented Dec 14, 2018

This is awesome! This feature has been on my personal wishlist for a while. I'm glad this is coming in soon.

@briankassouf
Copy link
Contributor

I think we should leave this behind token authentication. There currently isn't a good way to make sure an endpoint is only accessed via one of many server listener configs. Additionally, as @ncabatoff said, there are many helper libraries that can keep a token renewed for you (see consul template). Alternatively you could create a token with a very long TTL and only give it ACL permissions for the metrics endpoint.

The main change still required in this PR is to move the endpoint into the logical system backend. This will move it into an authenticated section of Vault's API. As it stands right now the endpoint is still unauthenticated. See vault/logical_system.go

@jefferai
Copy link
Member

Yeah at HashiConf I was saying that behind token was a better idea if possible which it seems it is.

@uepoch
Copy link
Contributor Author

uepoch commented Dec 18, 2018

Fine by me :)
Updating it to move the handler in behind token wall then
<-edit-> nvm, found it

@uepoch uepoch force-pushed the prometheus branch 2 times, most recently from 2ccd4d8 to f9817ff Compare December 20, 2018 16:51
@uepoch
Copy link
Contributor Author

uepoch commented Dec 20, 2018

Ok, it was nastier than I imagined to put it in Sys backend, but works as expected.

I added a Passthrough header for "Accept" in Sys backend to support the openmetrics norm as request in comments
I think it makes the patch a bit too intrusive, but I let you decide if we should remove it or not :)

@briankassouf briankassouf added this to the 1.0.3 milestone Jan 7, 2019
@hashicorp-cla
Copy link

hashicorp-cla commented Jan 15, 2019

CLA assistant check
All committers have signed the CLA.

@ncabatoff
Copy link
Contributor

Content-Lenght is automatically handled by Golang net/http handlers AFAIK, by default it's only added on small responses, but i'm maybe wrong on this one

I was asking for this based on comparison with promhttp.HandlerFor(), but it looks like they've stopped setting content length in recent versions. I retract my request.

ncabatoff
ncabatoff previously approved these changes Feb 1, 2019

acceptHeaders := req.Headers["Accept"]
if format == "prometheus" || (len(acceptHeaders) > 0 && strings.HasPrefix(acceptHeaders[0], "application/openmetrics-text")) {
if !b.Core.prometheusEnabled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for having this guard? It's an ACL'd call so presumably if you're allowed to access it you should be allowed to get that data in whatever format you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config

Do you feel we should add a by-default value and enable it for everybody ? @jefferai @ncabatoff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config

Right, but then unlike the other types you're plumbing a value all the way through core. Prometheus is just a format, I don't really see a reason to do this. If someone has access to the metrics, it seems like it ought to be able to be fetched in whatever format they want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, will modify to set a default value

@jefferai
Copy link
Member

jefferai commented Feb 1, 2019

We're going to 1.0.3 a bit earlier than expected so we're moving this to the 1.1 beta, which should be in a couple of weeks.

@jefferai jefferai modified the milestones: 1.0.3, 1.1 Feb 1, 2019
command/server.go Outdated Show resolved Hide resolved
command/server/config.go Outdated Show resolved Hide resolved
Expiration: telConfig.PrometheusRetentionTime,
}

sink, err := prometheus.NewPrometheusSinkFrom(prometheusOpts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you only want to do this if the retention time is not zero? That way it'd gate it based on configuration existing, like the other types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I may have misunderstood your previous comment then :(
You mentioned it might be better if users have all formats available with no gates, so right now, specifying a 0 value for Prometheus retention is return an error, with the default being 24h retention

Would you prefer that the behavior is still disabled if an explicit 0 is used ?

vault/logical_system.go Outdated Show resolved Hide resolved
vault/logical_system.go Outdated Show resolved Hide resolved
@briankassouf briankassouf changed the base branch from master to 1.1-beta February 14, 2019 20:46
Copy link
Contributor

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for all the hard work on this!

@briankassouf briankassouf merged commit 5dd50ef into hashicorp:1.1-beta Feb 14, 2019
seanmalloy added a commit to seanmalloy/vault that referenced this pull request Mar 23, 2019
Prometheus metrics were added as part
of the Vault v1.1.0 release in PR hashicorp#5308.
But no documentation was created. Adds
the telemetry configuration docs and
the API docs.
kalafut pushed a commit that referenced this pull request Mar 23, 2019
Prometheus metrics were added as part of the Vault v1.1.0 release in PR #5308.
But no documentation was created. Adds the telemetry configuration docs and
the API docs.
@martinssipenko
Copy link
Contributor

martinssipenko commented Apr 1, 2019

@uepoch Are metrics shared in cluster mode or should one scrape all cluster instances?

@ncabatoff
Copy link
Contributor

@uepoch Are metrics shared in cluster mode or should one scrape all cluster instances?

Metrics are internal to each process and not shared, so one should scrape all cluster instances.

In OSS, most metrics on standby nodes will be irrelevant (notable exceptions being HA and Seal related metrics.) . They should still be scraped, otherwise if the active node fails and a standby becomes primary your metrics will cease to be relevant. Similarly if a standby stops responding to scrape requests you know your HA solution has become less highly available.

@martinssipenko
Copy link
Contributor

@ncabatoff I was not able to find any HA/Seal metrics in the output. Am I missing something?

@ncabatoff
Copy link
Contributor

@ncabatoff I was not able to find any HA/Seal metrics in the output. Am I missing something?

Seal metrics were only just added in #6478.

Unfortunately due to the way we write Prometheus metrics they're not persistent, so if nothing has happened recently in a given domain you won't see any metrics for that domain.

The HA summary metrics I'm thinking of include vault_core_step_down, vault_core_leadership_lost, and vault_core_leadership_setup_failed. Try killing your active node (in a test env!) and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants