Deadlock after a leader election. #11436

etherandrius · 2021-04-21T23:51:09Z

Describe the bug
post-election task failed on a newly elected leader. This resulted in a cluster wide outage, which did not self resolve. Faulty leader had to be terminated manually.

To Reproduce
I was not able to reproduce the issue

Expected behavior
Either post-election task self recovered or leadership to be taken over by another vault instance.

Environment:

$ vault status
Key                    Value
---                    -----
Seal Type              shamir
Initialized            true
Sealed                 false
Total Shares           5
Threshold              3
Version                1.6.3
Storage Type           postgresql
Cluster Name           vault-cluster-69fd2ba1
Cluster ID             19039e5d-99da-5d8b-bf4a-d8e9b2c31ead
HA Enabled             true
HA Cluster             https://10.0.1.90:8201
HA Mode                standby
Active Node Address    https://10.0.1.90:8200

$ vault version
Vault v1.6.3 (b540be4b7ec48d0dd7512c8d8df9399d6bf84d76)

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

$ uname -m
x86_64

Vault server configuration file(s):

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/vault/config/cert.pem"
  tls_key_file  = "/vault/config/key.pem"
  tls_client_ca_file = "/vault/config/ca.pem"
}

cluster_addr = "https://10.0.1.90:8201"
api_addr     = "https://10.0.1.90:8200"

telemetry {
  dogstatsd_addr = "localhost:8125"
}

max_lease_ttl = "87600h"

plugin_directory = "/vault/config/plugins"

Additional context
Logs from vault leader

INFO acquired lock, enabling active operation
INFO post-unseal setup starting
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]
WARN no TLS config found for ALPN: ALPN=[req_fw_sb-act_v1]

Go routines were steadily increasing, until termination of the instance

I was not able to run vault debug received error Error during validation: unable to connect to server: context deadline exceeded

I was not able to curl /v1/sys/pprof/goroutine the connection hung for 20+min before I canceled it.

The issue seems similar to #11276 and #10456. However, we are running 1.6.3 and the bug was supposed to be fixed in 1.6.1.

So far we've only observed this once

The text was updated successfully, but these errors were encountered:

ncabatoff · 2021-04-22T14:11:42Z

Hi @etherandrius,

Thanks for reporting this. It's hard to say what happened without stack traces. If it happens again could you send a SIGUSR2 to the active node and find the stack traces in the logs?

etherandrius · 2021-04-22T14:33:35Z

@ncabatoff

I will.

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

ncabatoff · 2021-04-22T14:54:56Z

As a follow up to this could we add functionality to send SIGUSR2 via vault debug in case the main routine fails ?

There's no reason to assume vault debug is running on the same host as the vault server; indeed it's typically the opposite in my experience.

vishalnayak · 2021-06-08T15:36:17Z

Closing the issue due to staleness.

Closing stale issues helps us keep the issue count down and the project healthy. Keeping the issue count under a manageable number helps us provide faster responses and better engagement with the community.

If you feel that the issue is still relevant, or if it is wrongly closed, please leave a comment and we'd be happy to reopen it.

s3than · 2021-07-31T05:09:10Z

@vishalnayak I'd like to state that we've had a direct replica of this error.
Running 1.7.2

s3than · 2021-07-31T05:10:17Z

To resolve we had to manually remove the instance with the increase of goroutines which then resolved the deadlock.

ncabatoff · 2021-07-31T12:27:18Z

@s3than we're definitely interested in hearing more. I suggest you open a new bug and provide us with whatever details you have.

vishalnayak added the waiting-for-response label Jun 3, 2021

vishalnayak closed this as completed Jun 8, 2021

s3than mentioned this issue Aug 1, 2021

Deadlock blocking leader election #12223

Closed

impl mentioned this issue Sep 21, 2021

Deadlock on HA leadership transfer when standby was actively forwarding a request #12601

Closed

heatherezell removed the waiting-for-response label Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock after a leader election. #11436

Deadlock after a leader election. #11436

etherandrius commented Apr 21, 2021 •

edited

Loading

ncabatoff commented Apr 22, 2021

etherandrius commented Apr 22, 2021

ncabatoff commented Apr 22, 2021

vishalnayak commented Jun 8, 2021

s3than commented Jul 31, 2021

s3than commented Jul 31, 2021

ncabatoff commented Jul 31, 2021

Deadlock after a leader election. #11436

Deadlock after a leader election. #11436

Comments

etherandrius commented Apr 21, 2021 • edited Loading

ncabatoff commented Apr 22, 2021

etherandrius commented Apr 22, 2021

ncabatoff commented Apr 22, 2021

vishalnayak commented Jun 8, 2021

s3than commented Jul 31, 2021

s3than commented Jul 31, 2021

ncabatoff commented Jul 31, 2021

etherandrius commented Apr 21, 2021 •

edited

Loading