Prometheus support on v1/sys/metrics endpoint #5308

uepoch · 2018-09-10T13:01:27Z

Add support for Prometheus and InMemSink described in #5223

This add supports for a Prometheus Sink in Telemetry config, available by specifying a prometheus_retention_time key in telemetry configuration, similar as in consul or nomad.

To pass the information through the config and server command to the http handler, it expands the HandlerProps to pass inmemory and potential prometheus retention time.

Also made the choices to:

support parsing the key from a string since Nanoseconds are hard to understand for humans, and prometheus regulars retentions are more in seconds.
Enable prometheus in dev-mode

This does not forward anything to leader, each node has its own sinks.

Big vendoring of prometheus lib

uepoch · 2018-09-24T08:50:12Z

@jefferai, sorry for highlighting, do you think this is enough for #5223 ?
We've been using it in our infra for 1-2 weeks no big problems, except for go-metrics hostname handling when you're not passing disable_hostname in the telemetry config hashicorp/go-metrics#83

jefferai · 2018-09-24T22:23:38Z

I have not looked at the code, but one thing that was surfaced to me by a team member is that this uses an unauthenticated endpoint. The endpoint must be authenticated.

uepoch · 2018-09-25T15:06:22Z

Prometheus does not support passing arbitrary headers (hence X-Vault-Token)

After searching, a PR has been made on Consul to be compliant with RFC6750 and allow Authorization Bearer headers to be used in place of X-Consul-Token
hashicorp/consul#4502
Do you think it's something you'll want to support in the future ?

jefferai · 2018-09-25T15:52:31Z

Makes sense to me; @chrishoffman @briankassouf thoughts?

jefferai · 2018-09-25T15:56:13Z

Note for the team: if we support this we need to be sure to blacklist Authorization from being a passthrough request header.

briankassouf · 2018-09-25T16:38:14Z

Sounds good to me too

uepoch · 2018-11-12T16:15:25Z

Hello again, small update:

So #5397 has been merged so I guess it could be used

However, when discussing at Hashiconf with @jefferai, you mentioned that metrology informations are sensitive, so should not be open in the wild, and we discussed about maybe adding it behind a separate listener with CIDR limitations, would that still be ok ?

This would ease the integration with third-party monitoring tools

jefferai · 2018-12-11T16:40:41Z

I think if it's behind a token additional CIDR restrictions aren't necessary, especially given you can pop CIDR restrictions on tokens as well. @briankassouf agree?

briankassouf · 2018-12-11T18:08:34Z

Yep, now that it is behind a token we should be okay to allow these to be exported through the API

uepoch · 2018-12-12T10:07:13Z

The point of binding to another listener was to avoid leaking potentially sensitive informations (since one may consider vault's metrics sensitive) and still be usable by regular monitoring system without having to authenticate.
It's been a while since we talked about it tbh so I'm not 100% clear about the details but that seemed acceptable for you to allow the endpoint to be unauthentified, if behind a separate listener for which the user can provide CIDR limitations

It's not impossible for prometheus to hit with a vault token, but would complexify existing setups for all users as it has no logic by itself to renew a provided token

What do you think @jefferai @briankassouf ?

command/server.go

http/sys_metrics.go

command/server/config.go

ncabatoff · 2018-12-13T20:28:44Z

Prometheus may not have a builtin way to renew a provided token, but it has the bearer_token_file config option, so something external like a cronjob can handle renewal.

Why not proceed with the existing PR under the assumption that the token is required, and use another issue for the tokenless CIDR-restricted listener idea? That way we can at least get the work you've already done merged, rather than have it blocked by a more contentious extra feature.

I tried applying the changes on the tip of master and was able to monitor Vault using Prometheus with bearer token auth. As far as I'm concerned it looks almost ready to go.

Thanks for doing this by the way, I love Prometheus and look forward to being able to monitor Vault with it!

evanmcclure · 2018-12-14T22:56:41Z

This is awesome! This feature has been on my personal wishlist for a while. I'm glad this is coming in soon.

briankassouf · 2018-12-14T23:08:23Z

I think we should leave this behind token authentication. There currently isn't a good way to make sure an endpoint is only accessed via one of many server listener configs. Additionally, as @ncabatoff said, there are many helper libraries that can keep a token renewed for you (see consul template). Alternatively you could create a token with a very long TTL and only give it ACL permissions for the metrics endpoint.

The main change still required in this PR is to move the endpoint into the logical system backend. This will move it into an authenticated section of Vault's API. As it stands right now the endpoint is still unauthenticated. See vault/logical_system.go

jefferai · 2018-12-14T23:15:40Z

Yeah at HashiConf I was saying that behind token was a better idea if possible which it seems it is.

uepoch · 2018-12-18T16:20:47Z

Fine by me :)
Updating it to move the handler in behind token wall then
<-edit-> nvm, found it

uepoch · 2018-12-20T17:07:28Z

Ok, it was nastier than I imagined to put it in Sys backend, but works as expected.

I added a Passthrough header for "Accept" in Sys backend to support the openmetrics norm as request in comments
I think it makes the patch a bit too intrusive, but I let you decide if we should remove it or not :)

hashicorp-cla · 2019-01-15T17:33:40Z

All committers have signed the CLA.

ncabatoff · 2019-02-01T16:23:45Z

Content-Lenght is automatically handled by Golang net/http handlers AFAIK, by default it's only added on small responses, but i'm maybe wrong on this one

I was asking for this based on comparison with promhttp.HandlerFor(), but it looks like they've stopped setting content length in recent versions. I retract my request.

jefferai · 2019-02-01T20:09:29Z

vault/logical_system.go

+
+	acceptHeaders := req.Headers["Accept"]
+	if format == "prometheus" || (len(acceptHeaders) > 0 && strings.HasPrefix(acceptHeaders[0], "application/openmetrics-text")) {
+		if !b.Core.prometheusEnabled {


What's the reason for having this guard? It's an ACL'd call so presumably if you're allowed to access it you should be allowed to get that data in whatever format you want.

I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config

Do you feel we should add a by-default value and enable it for everybody ? @jefferai @ncabatoff

I used the same logic as other telemetry backends, with it being disabled by default, unless you specify a valid retention in telemetry config

Right, but then unlike the other types you're plumbing a value all the way through core. Prometheus is just a format, I don't really see a reason to do this. If someone has access to the metrics, it seems like it ought to be able to be fetched in whatever format they want.

Fair point, will modify to set a default value

jefferai · 2019-02-01T20:09:56Z

We're going to 1.0.3 a bit earlier than expected so we're moving this to the 1.1 beta, which should be in a couple of weeks.

command/server.go

command/server/config.go

jefferai · 2019-02-04T17:59:44Z

command/server.go

+		Expiration: telConfig.PrometheusRetentionTime,
+	}
+
+	sink, err := prometheus.NewPrometheusSinkFrom(prometheusOpts)


Wouldn't you only want to do this if the retention time is not zero? That way it'd gate it based on configuration existing, like the other types.

Sorry, I may have misunderstood your previous comment then :(
You mentioned it might be better if users have all formats available with no gates, so right now, specifying a 0 value for Prometheus retention is return an error, with the default being 24h retention

Would you prefer that the behavior is still disabled if an explicit 0 is used ?

vault/logical_system.go

briankassouf

Looks great! Thanks for all the hard work on this!

Prometheus metrics were added as part of the Vault v1.1.0 release in PR hashicorp#5308. But no documentation was created. Adds the telemetry configuration docs and the API docs.

Prometheus metrics were added as part of the Vault v1.1.0 release in PR #5308. But no documentation was created. Adds the telemetry configuration docs and the API docs.

martinssipenko · 2019-04-01T13:30:20Z

@uepoch Are metrics shared in cluster mode or should one scrape all cluster instances?

ncabatoff · 2019-04-01T13:44:57Z

@uepoch Are metrics shared in cluster mode or should one scrape all cluster instances?

Metrics are internal to each process and not shared, so one should scrape all cluster instances.

In OSS, most metrics on standby nodes will be irrelevant (notable exceptions being HA and Seal related metrics.) . They should still be scraped, otherwise if the active node fails and a standby becomes primary your metrics will cease to be relevant. Similarly if a standby stops responding to scrape requests you know your HA solution has become less highly available.

martinssipenko · 2019-04-01T13:46:42Z

@ncabatoff I was not able to find any HA/Seal metrics in the output. Am I missing something?

ncabatoff · 2019-04-01T13:58:06Z

@ncabatoff I was not able to find any HA/Seal metrics in the output. Am I missing something?

Seal metrics were only just added in #6478.

Unfortunately due to the way we write Prometheus metrics they're not persistent, so if nothing has happened recently in a given domain you won't see any metrics for that domain.

The HA summary metrics I'm thinking of include vault_core_step_down, vault_core_leadership_lost, and vault_core_leadership_setup_failed. Try killing your active node (in a test env!) and see what happens.

uepoch mentioned this pull request Sep 17, 2018

Add forward-slash as illegal character in prometheus hashicorp/go-metrics#84

Merged

uepoch force-pushed the prometheus branch 5 times, most recently from 2724bef to fa05cfc Compare September 18, 2018 13:28

Dudesons mentioned this pull request Dec 11, 2018

Support same metrics endpoints as nomad and consul #5223

Closed

ncabatoff reviewed Dec 13, 2018

View reviewed changes

uepoch force-pushed the prometheus branch 2 times, most recently from 2ccd4d8 to f9817ff Compare December 20, 2018 16:51

briankassouf added this to the 1.0.3 milestone Jan 7, 2019

uepoch force-pushed the prometheus branch from 23b2caa to 2ccd4d8 Compare January 21, 2019 16:47

initial commit for prometheus and sys/metrics support

cadf8f3

ncabatoff previously approved these changes Feb 1, 2019

View reviewed changes

jefferai reviewed Feb 1, 2019

View reviewed changes

jefferai modified the milestones: 1.0.3, 1.1 Feb 1, 2019

jefferai reviewed Feb 4, 2019

View reviewed changes

command/server.go Outdated Show resolved Hide resolved

jefferai reviewed Feb 4, 2019

View reviewed changes

command/server/config.go Outdated Show resolved Hide resolved

enable prometheus sink by default

cc45124

uepoch dismissed ncabatoff’s stale review via cc45124 February 4, 2019 15:28

jefferai reviewed Feb 4, 2019

View reviewed changes

vault/logical_system.go Outdated Show resolved Hide resolved

jefferai reviewed Feb 4, 2019

View reviewed changes

vault/logical_system.go Outdated Show resolved Hide resolved

Martin Conraux and others added 2 commits February 12, 2019 10:54

Move Metric-related code in a separate metricsutil helper

cdb2cc7

Merge branch 'master' into prometheus

a24a6ee

briankassouf changed the base branch from master to 1.1-beta February 14, 2019 20:46

briankassouf approved these changes Feb 14, 2019

View reviewed changes

briankassouf merged commit 5dd50ef into hashicorp:1.1-beta Feb 14, 2019

This was referenced Mar 19, 2019

Add Docs For Prometheus Metrics #6433

Closed

Add Docs For Prometheus Metrics #6434

Merged

kalafut pushed a commit that referenced this pull request Mar 23, 2019

Add Docs For Prometheus Metrics (#6434)

ffd437a

Prometheus metrics were added as part of the Vault v1.1.0 release in PR #5308. But no documentation was created. Adds the telemetry configuration docs and the API docs.

michelvocks mentioned this pull request Feb 3, 2020

Telemetry: add prometheus endpoint option #2937

Closed

fischerman mentioned this pull request Feb 5, 2020

Monitoring using Prometheus hashicorp/consul-template#1338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus support on v1/sys/metrics endpoint #5308

Prometheus support on v1/sys/metrics endpoint #5308

uepoch commented Sep 10, 2018 •

edited

Loading

uepoch commented Sep 24, 2018

jefferai commented Sep 24, 2018

uepoch commented Sep 25, 2018 •

edited

Loading

jefferai commented Sep 25, 2018

jefferai commented Sep 25, 2018

briankassouf commented Sep 25, 2018

uepoch commented Nov 12, 2018

jefferai commented Dec 11, 2018

briankassouf commented Dec 11, 2018

uepoch commented Dec 12, 2018 •

edited

Loading

ncabatoff commented Dec 13, 2018

evanmcclure commented Dec 14, 2018 •

edited

Loading

briankassouf commented Dec 14, 2018

jefferai commented Dec 14, 2018

uepoch commented Dec 18, 2018 •

edited

Loading

uepoch commented Dec 20, 2018

hashicorp-cla commented Jan 15, 2019 •

edited

Loading

ncabatoff commented Feb 1, 2019

jefferai Feb 1, 2019

uepoch Feb 4, 2019

jefferai Feb 4, 2019

uepoch Feb 4, 2019

jefferai commented Feb 1, 2019

jefferai Feb 4, 2019

uepoch Feb 11, 2019

briankassouf left a comment

martinssipenko commented Apr 1, 2019 •

edited

Loading

ncabatoff commented Apr 1, 2019

martinssipenko commented Apr 1, 2019

ncabatoff commented Apr 1, 2019

Prometheus support on v1/sys/metrics endpoint #5308

Prometheus support on v1/sys/metrics endpoint #5308

Conversation

uepoch commented Sep 10, 2018 • edited Loading

uepoch commented Sep 24, 2018

jefferai commented Sep 24, 2018

uepoch commented Sep 25, 2018 • edited Loading

jefferai commented Sep 25, 2018

jefferai commented Sep 25, 2018

briankassouf commented Sep 25, 2018

uepoch commented Nov 12, 2018

jefferai commented Dec 11, 2018

briankassouf commented Dec 11, 2018

uepoch commented Dec 12, 2018 • edited Loading

ncabatoff commented Dec 13, 2018

evanmcclure commented Dec 14, 2018 • edited Loading

briankassouf commented Dec 14, 2018

jefferai commented Dec 14, 2018

uepoch commented Dec 18, 2018 • edited Loading

uepoch commented Dec 20, 2018

hashicorp-cla commented Jan 15, 2019 • edited Loading

ncabatoff commented Feb 1, 2019

jefferai Feb 1, 2019

Choose a reason for hiding this comment

uepoch Feb 4, 2019

Choose a reason for hiding this comment

jefferai Feb 4, 2019

Choose a reason for hiding this comment

uepoch Feb 4, 2019

Choose a reason for hiding this comment

jefferai commented Feb 1, 2019

jefferai Feb 4, 2019

Choose a reason for hiding this comment

uepoch Feb 11, 2019

Choose a reason for hiding this comment

briankassouf left a comment

Choose a reason for hiding this comment

martinssipenko commented Apr 1, 2019 • edited Loading

ncabatoff commented Apr 1, 2019

martinssipenko commented Apr 1, 2019

ncabatoff commented Apr 1, 2019

uepoch commented Sep 10, 2018 •

edited

Loading

uepoch commented Sep 25, 2018 •

edited

Loading

uepoch commented Dec 12, 2018 •

edited

Loading

evanmcclure commented Dec 14, 2018 •

edited

Loading

uepoch commented Dec 18, 2018 •

edited

Loading

hashicorp-cla commented Jan 15, 2019 •

edited

Loading

martinssipenko commented Apr 1, 2019 •

edited

Loading