Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock after a leader election. #11436

Closed
etherandrius opened this issue Apr 21, 2021 · 7 comments
Closed

Deadlock after a leader election. #11436

etherandrius opened this issue Apr 21, 2021 · 7 comments

Comments

@etherandrius
Copy link

etherandrius commented Apr 21, 2021

Describe the bug
post-election task failed on a newly elected leader. This resulted in a cluster wide outage, which did not self resolve. Faulty leader had to be terminated manually.

To Reproduce
I was not able to reproduce the issue

Expected behavior
Either post-election task self recovered or leadership to be taken over by another vault instance.

Environment:

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.6.3
Storage Type           postgresql
Cluster Name           vault-cluster-69fd2ba1
Cluster ID             19039e5d-99da-5d8b-bf4a-d8e9b2c31ead
HA Enabled             true
HA Cluster             https://10.0.1.90:8201
HA Mode                standby
Active Node Address    https://10.0.1.90:8200
$ vault version
Vault v1.6.3 (b540be4b7ec48d0dd7512c8d8df9399d6bf84d76)
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

$ uname -m
x86_64

Vault server configuration file(s):

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/vault/config/cert.pem"
  tls_key_file  = "/vault/config/key.pem"
  tls_client_ca_file = "/vault/config/ca.pem"
}

cluster_addr = "https://10.0.1.90:8201"
api_addr     = "https://10.0.1.90:8200"

telemetry {
  dogstatsd_addr = "localhost:8125"
}

max_lease_ttl = "87600h"

plugin_directory = "/vault/config/plugins"

Additional context
Logs from vault leader

INFO acquired lock, enabling active operation
INFO post-unseal setup starting
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]

Go routines were steadily increasing, until termination of the instance
image

I was not able to run vault debug received error Error during validation: unable to connect to server: context deadline exceeded

I was not able to curl /v1/sys/pprof/goroutine the connection hung for 20+min before I canceled it.

The issue seems similar to #11276 and #10456. However, we are running 1.6.3 and the bug was supposed to be fixed in 1.6.1.

So far we've only observed this once

@ncabatoff
Copy link
Contributor

Hi @etherandrius,

Thanks for reporting this. It's hard to say what happened without stack traces. If it happens again could you send a SIGUSR2 to the active node and find the stack traces in the logs?

@etherandrius
Copy link
Author

@ncabatoff

I will.

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

@ncabatoff
Copy link
Contributor

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

There's no reason to assume vault debug is running on the same host as the vault server; indeed it's typically the opposite in my experience.

@vishalnayak
Copy link
Member

Closing the issue due to staleness.

Closing stale issues helps us keep the issue count down and the project healthy. Keeping the issue count under a manageable number helps us provide faster responses and better engagement with the community.

If you feel that the issue is still relevant, or if it is wrongly closed, please leave a comment and we'd be happy to reopen it.

@s3than
Copy link

s3than commented Jul 31, 2021

@vishalnayak I'd like to state that we've had a direct replica of this error.
Running 1.7.2

@s3than
Copy link

s3than commented Jul 31, 2021

To resolve we had to manually remove the instance with the increase of goroutines which then resolved the deadlock.

@ncabatoff
Copy link
Contributor

@s3than we're definitely interested in hearing more. I suggest you open a new bug and provide us with whatever details you have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants