Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deadlock on core state lock #10456

Merged
merged 3 commits into from
Dec 10, 2020

Conversation

sbunce
Copy link
Contributor

@sbunce sbunce commented Nov 25, 2020

The most important part of evaluating this change is to understand the invariant that if grabLockOrStop returns stopped=true then the lock is not supposed to be held. The original code was violating that invariant for some concurrent executions. The included unit test demonstrates that it was possible for grabLockOrStop to return stopped=true and for the lock to be held. You can see the unit test fail with the old code by copy/paste the unit test to run on the old code.

The first solution I tried was to add a sync.WaitGroup so the grabLockOrStop goroutine waits for the child goroutine to exit. This created an extra delay when the stop channel was closed which upset a lot of unit tests. I mention this because it helps to explain the required behavior of the grabLockOrStop function, which is subtle.

The second solution is to introduce a lock and some locked state around which the parent and child can handle the different concurrent executions.

Concurrent Executions:

  1. The parent gets stopped (by stopCh) and returns before the child is done locking. To handle this the parent sets parentWaiting=false. The child will see that the parent isn't waiting and unlocks the lock.
  2. The parent gets stopped (by stopCh) after the child has locked the lock. To handle this the parent notices that the child set locked=true, so it knows that it has to unlock before returning stopped=true.

@hashicorp-cla
Copy link

hashicorp-cla commented Nov 25, 2020

CLA assistant check
All committers have signed the CLA.

@ncabatoff
Copy link
Contributor

A comment in grabLockOrStop indicates that the function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock.

I think you misread the comment. It says:

// If stopCh has been closed, which only happens while the
// stateLock is held, we have actually terminated, so we just
// instantly give up the lock, otherwise we notify that it's ready
// for consumption

Although in practice we only use this function for the stateLock, it could be used for others, so I would probably rewrite this comment as:

// stopCh should only be closed by the holder of a write lock on the underlying mutex
// the caller is trying to acquire a lock on (regardless of whether lockFunc is a read
// or write lock function.). If stopCh is closed, we give up, release the lock if we managed
// to acquire it, and return true.  Otherwise we block until we acquire the lock and return false.

@sbunce sbunce marked this pull request as ready for review November 26, 2020 02:02
@sbunce
Copy link
Contributor Author

sbunce commented Nov 30, 2020

@ncabatoff

I refactored the function comment to make it more clear. Looks like what is being accomplished by the grabLockOrStop function is making sync.Mutex.Lock cancelable. A better way to accomplish this would be to switch c.stateLock to be a channel buffered to 1, but that'd be a much larger change.

Our vault enterprise deployment has been getting deadlocked because c.stateLock has an exclusive lock acquired but never released. I'm not sure if this issue caused it, but the original grabLockOrStop code is incorrect because it can return stopped=true and leave the lock held randomly, depending on the timing of when stopCh is closed. It doesn't look like usages of grabLockOrStop on c.stateLock in the open source code would cause this issue. I wonder if there's code in vault enterprise that's acquiring c.stateLock with grabLockOrStop and closing stopCh concurrently?

@ncabatoff
Copy link
Contributor

I wonder if there's code in vault enterprise that's acquiring c.stateLock with grabLockOrStop and closing stopCh concurrently?

Yeah, that's plausible - at least I can't rule it out yet. Do you by any chance have a goroutine dump you could share from when the deadlocking was happening?

@sbunce
Copy link
Contributor Author

sbunce commented Nov 30, 2020

This is a 33MiB file that contains stack traces from our deadlocked vault enterprise 1.5.5 server.
https://drive.google.com/file/d/1V1GqHDOw2LTr8NHBLaCt0GEegZmclk5z/view?usp=sharing

Here are my main learnings from it.

There are thousands of goroutines blocked (for over 100 minutes) on this line trying to acquire a read lock on c.stateLock. These are coming from a request forwarding HTTP handler that's checking if the vault instance is a performance standby. So every HTTP request that came in leaked a goroutine because of this deadlock.

c.stateLock.RLock()

goroutine 12126102 [semacquire, 111 minutes]:
sync.runtime_SemacquireMutex(0xc0003e7a3c, 0xc037023300, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/goroot/src/sync/rwmutex.go:50
github.com/hashicorp/vault/vault.(*Core).PerfStandby(0xc0003e7800, 0xc000a79a40)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:69 +0x9f
github.com/hashicorp/vault/http.handleRequestForwarding.func1(0x44197a0, 0xc036e02380, 0xc036e84600)
	/gopath/src/github.com/hashicorp/vault/http/handler.go:665 +0x56
net/http.HandlerFunc.ServeHTTP(0xc0005b9120, 0x44197a0, 0xc036e02380, 0xc036e84600)
	/goroot/src/net/http/server.go:2041 +0x44
<truncated uninteresting stuff>

There is one goroutine blocked (for over 100 minutes) on this line trying to acquire a write lock on c.stateLock.

lockFunc()

goroutine 12123671 [semacquire, 111 minutes]:
sync.runtime_SemacquireMutex(0xc0003e7a38, 0x0, 0x0)
	/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).Lock(0xc0003e7a30)
	/goroot/src/sync/rwmutex.go:103 +0x88
github.com/hashicorp/vault/vault.grabLockOrStop.func1(0xc028ae8580, 0xc00046a7e0, 0xc028ae8590, 0xc00564f980)
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:637 +0x27
created by github.com/hashicorp/vault/vault.grabLockOrStop
	/gopath/src/github.com/hashicorp/vault/vault/ha.go:635 +0x96

Because we have a bunch of goroutines blocked on c.stateLock.Rlock() and one blocked on c.stateLock.Lock(), it implies that an exclusive lock is already held on c.stateLock. Because the goroutines are blocked for 100+ minutes it implies a deadlock.

I also searched for stack traces stuck in critical sections when c.stateLock.Lock was held to check for a deadlock caused by c.stateLock.Lock being called and then getting hung on something before it could unlock. I didn't find any stack traces hung in one of these critical sections though. This makes it look like there was a c.stateLock.Lock that wasn't matched by a c.stateLock.Unlock.

The race condition with grabLockOrStop could cause there to be a c.stateLock.Lock that's not matched by a c.stateLock.Unlock, if grabLockOrStop returns stopped=true after locking c.stateLock and not unlocking it (that's the bug). It doesn't look like any of the callers in the open source code would trigger this bug which makes me wonder if there's something in the vault enterprise code.

It'd also be good to look for a c.stateLock.Lock in the vault enterprise code to see if it's not matched with a c.stateLock.Unlock.

The bug is in the grabLockOrStop function. For specific concurrent
executions the grabLockOrStop function can return stopped=true when
the lock is still held. A comment in grabLockOrStop indicates that the
function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock. If there are concurrent goroutines
using grabLockOrStop then some concurrent executions result in
stopped=true being returned when the lock is acquired.

The fix is to add a lock and some state around which the parent and
child goroutine in the grabLockOrStop function can coordinate so that
the different concurrent executions can be handled.

This change includes a non-deterministic unit test which reliably
reproduces the problem before the fix.
vault/ha_test.go Outdated
go func() {
defer closerWg.Done()
// Close the stop channel half the time.
if time.Now().UnixNano() % 2 == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested this by adding an atomic to count how often the close path was taken, I found it was always taken. I suggest instead using something like rand.Int31().

@ncabatoff
Copy link
Contributor

Hi @sbunce,

I've gone over the new code carefully and I agree that this should fix the issue you found. The test is good too, modulo a minor comment I added. Since this is such a critical func I'm going to have someone else look at it too. Thanks very much for the contribution!

@ncabatoff ncabatoff added this to the 1.6.1 milestone Dec 8, 2020
Using time.Now().UnixNano()%2 ends up being system dependent because
different operating systems and hardware have different clock
resolution. A lower resolution will return the same unix time for a
longer period of time.

It is better to avoid this issue by using a random number generator.
This change uses the rand package default random number generator. It's
generally good to avoid using the default random number generator,
because it creates extra lock contention. For a test it should be fine.
@sbunce
Copy link
Contributor Author

sbunce commented Dec 8, 2020

On my system the close is only done roughly 1/2 the time. It looks like what's happening is that time.Now().UnixNano() returns the same value on a bunch of subsequent calls depending on the resolution of the clock, which is dependent on operating system and hardware.

Your idea to use a random number generator is definitely better because it does not have this problem. I switched to that and verified it worked by temporarily putting some counters in.

Good catch!

Copy link
Contributor

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look great to me, thanks for the in-depth description of the issue and tracking down this tricky deadlock.

It's nice that we don't have to worry about this restriction anymore, making the helper more useful overall:

If stopCh has been closed, which only happens while the stateLock is held

@ncabatoff ncabatoff merged commit 0998d51 into hashicorp:master Dec 10, 2020
ncabatoff pushed a commit that referenced this pull request Dec 10, 2020
* fix race that can cause deadlock on core state lock

The bug is in the grabLockOrStop function. For specific concurrent
executions the grabLockOrStop function can return stopped=true when
the lock is still held. A comment in grabLockOrStop indicates that the
function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock. If there are concurrent goroutines
using grabLockOrStop then some concurrent executions result in
stopped=true being returned when the lock is acquired.

The fix is to add a lock and some state around which the parent and
child goroutine in the grabLockOrStop function can coordinate so that
the different concurrent executions can be handled.

This change includes a non-deterministic unit test which reliably
reproduces the problem before the fix.

* use rand instead of time for random test stopCh close

Using time.Now().UnixNano()%2 ends up being system dependent because
different operating systems and hardware have different clock
resolution. A lower resolution will return the same unix time for a
longer period of time.

It is better to avoid this issue by using a random number generator.
This change uses the rand package default random number generator. It's
generally good to avoid using the default random number generator,
because it creates extra lock contention. For a test it should be fine.
ncabatoff added a commit that referenced this pull request Dec 10, 2020
* fix deadlock on core state lock (#10456)

* fix race that can cause deadlock on core state lock

The bug is in the grabLockOrStop function. For specific concurrent
executions the grabLockOrStop function can return stopped=true when
the lock is still held. A comment in grabLockOrStop indicates that the
function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock. If there are concurrent goroutines
using grabLockOrStop then some concurrent executions result in
stopped=true being returned when the lock is acquired.

The fix is to add a lock and some state around which the parent and
child goroutine in the grabLockOrStop function can coordinate so that
the different concurrent executions can be handled.

This change includes a non-deterministic unit test which reliably
reproduces the problem before the fix.

* use rand instead of time for random test stopCh close

Using time.Now().UnixNano()%2 ends up being system dependent because
different operating systems and hardware have different clock
resolution. A lower resolution will return the same unix time for a
longer period of time.

It is better to avoid this issue by using a random number generator.
This change uses the rand package default random number generator. It's
generally good to avoid using the default random number generator,
because it creates extra lock contention. For a test it should be fine.

* changelog/10456.txt

Co-authored-by: Seth Bunce <seth.bunce@getcruise.com>
ncabatoff pushed a commit that referenced this pull request Mar 31, 2021
* fix race that can cause deadlock on core state lock

The bug is in the grabLockOrStop function. For specific concurrent
executions the grabLockOrStop function can return stopped=true when
the lock is still held. A comment in grabLockOrStop indicates that the
function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock. If there are concurrent goroutines
using grabLockOrStop then some concurrent executions result in
stopped=true being returned when the lock is acquired.

The fix is to add a lock and some state around which the parent and
child goroutine in the grabLockOrStop function can coordinate so that
the different concurrent executions can be handled.

This change includes a non-deterministic unit test which reliably
reproduces the problem before the fix.

* use rand instead of time for random test stopCh close

Using time.Now().UnixNano()%2 ends up being system dependent because
different operating systems and hardware have different clock
resolution. A lower resolution will return the same unix time for a
longer period of time.

It is better to avoid this issue by using a random number generator.
This change uses the rand package default random number generator. It's
generally good to avoid using the default random number generator,
because it creates extra lock contention. For a test it should be fine.
mladlow added a commit that referenced this pull request Apr 12, 2021
* fix race that can cause deadlock on core state lock

The bug is in the grabLockOrStop function. For specific concurrent
executions the grabLockOrStop function can return stopped=true when
the lock is still held. A comment in grabLockOrStop indicates that the
function is only used when the stateLock is held, but grabLockOrStop is
being used to acquire the stateLock. If there are concurrent goroutines
using grabLockOrStop then some concurrent executions result in
stopped=true being returned when the lock is acquired.

The fix is to add a lock and some state around which the parent and
child goroutine in the grabLockOrStop function can coordinate so that
the different concurrent executions can be handled.

This change includes a non-deterministic unit test which reliably
reproduces the problem before the fix.

* use rand instead of time for random test stopCh close

Using time.Now().UnixNano()%2 ends up being system dependent because
different operating systems and hardware have different clock
resolution. A lower resolution will return the same unix time for a
longer period of time.

It is better to avoid this issue by using a random number generator.
This change uses the rand package default random number generator. It's
generally good to avoid using the default random number generator,
because it creates extra lock contention. For a test it should be fine.

Co-authored-by: Seth Bunce <seth.bunce@getcruise.com>
Co-authored-by: Meggie <meggie@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants