-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed ControllerUnpublish error handling #165
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (I might not be aware of all consequences of this change, though)
pkg/controller/csi_handler_test.go
Outdated
@@ -266,6 +266,7 @@ func TestCSIHandler(t *testing.T) { | |||
var success error | |||
var readWrite = false | |||
var readOnly = true | |||
var ignored = false // the vlaue is irrelevant for given call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: vlaue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -88,11 +85,8 @@ func (a *attacher) Detach(ctx context.Context, volumeID string, nodeID string, s | |||
Secrets: secrets, | |||
} | |||
|
|||
_, err = client.ControllerUnpublishVolume(ctx, &req) | |||
if err != nil { | |||
return isFinalError(err), err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried this will break detach for plugins that interpret the CSI spec as such:
CSI spec says sp must return NotFound if volume or node is not found: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume-errors.
I guess here we were assuming if NotFound was returned, the detach is actually irrelevant/successful. Two scenarios to consider:
-
SP automatically handles volume or node not found as detached: no Unpublish call is actually needed to clean up. Should the sp go against the CSI spec and return success?
-
SP needs ControllerUnpublish called to clean up its state. Still a question of if it should return success or NotFound error afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added code that interprets NotFound
as success and filled container-storage-interface/spec#373 to clarify it in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on CSI spec meeting this morning, it sounds like there is agreement to relax the NotFound error code so that plugins can decide if the unpublish needs to be retried. Based on that, I think we should just retry for any error.
And given the fact that all of the csi plugins we've looked at so far are handling this wrong already, we probably should add an "ACTION REQUIRED" to the release note.
76d3701
to
679ffb8
Compare
pkg/attacher/attacher.go
Outdated
// This is not gRPC error. | ||
return err | ||
} | ||
if st.Code() == codes.NotFound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about a case where SP
temporarily cannot "find" the volume or the disk, but the disk is still attached. This should be NOT_FOUND
error that could be remedied by retrying - disk is not actually detached yet.
I would argue that the SP
should return OK
if it has determined that the volume/node is gone in a way that we can assume the volume is detached. This should be up to the SP
to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're assuming all volume plugins automatically detach volumes when a node is deleted, and I'm not confident that we can assume that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a reasonable compromise until this behaviour is clarified in spec. I do not know any volume type where attachment outlives lifecycle of node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry ignore my last comment, I misread SP as CO. We're on the same page :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way I think we are asking plugin authors to reinterpret spec in a way which was previously not documented.
Esp. for node deletion case - do we have a solution in the meanwhile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think vsphere may have some interesting behaviors around node failure/deletion handling. cc @codenrhoden @vladimirvivien
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik - in vsphere too, when you delete a node in vcenter, the vmdk files are marked as detached.
I checked AWS, it most probably returns |
Cinder: it returns |
Filed kubernetes/cloud-provider-openstack#718 for OpenStack / Cinder |
Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).
11ac689
to
452b089
Compare
I removed special handling of |
And updated release-note. |
/lgtm @davidz627, I and a few others discussed this, and we came to the conclusion that this change will require a major version bump because we're significantly changing behavior that may require drivers to change. |
At the same time it fixes pretty serious bug that should be fixed in all supported branches. Volumes are marked as detached after "final error" (e.g. I could backport safer version of this patch, marking a volume as detached after |
The behavior of the external-attacher recently changed (kubernetes-csi/external-attacher#165) such that it now treats "not found" as real error. The effect was that some Kubernetes E2E tests (like "CSI mock volume CSI workload information using mock driver should not be passed when podInfoOnMount=false") sometimes ran for over 2 minutes, just waiting for detatch. That the test then proceeds without marking the test as failed is a bug in the test cleanup code which will be fixed. This slowdown is not deterministic: sometimes the detach is done early enough while the volume still exists. With this change, the same test completes in under 30 seconds.
Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).
What type of PR is this?
/kind bug
Which issue(s) this PR fixes:
Fixes #164
Does this PR introduce a user-facing change?:
cc @kubernetes-csi/csi-misc