Fixed ControllerUnpublish error handling #165

jsafrane · 2019-08-06T08:37:00Z

Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).

What type of PR is this?
/kind bug

Which issue(s) this PR fixes:
Fixes #164

Does this PR introduce a user-facing change?:

Action required: processing of ControllerUnpublish errors has changed. CSI drivers SHALL return success (0), when a deleted node or volume implies that the volume is detached from the node. The external attacher treats NotFound error as any other error and it assumes that the volume may still be attached to the node. Please check behavior of your CSI driver and fix it accordingly.

cc @kubernetes-csi/csi-misc

k8s-ci-robot · 2019-08-06T08:37:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bertinatto · 2019-08-06T12:27:10Z

/retest

bertinatto

LGTM (I might not be aware of all consequences of this change, though)

bertinatto · 2019-08-06T12:33:21Z

pkg/controller/csi_handler_test.go

@@ -266,6 +266,7 @@ func TestCSIHandler(t *testing.T) {
 	var success error
 	var readWrite = false
 	var readOnly = true
+	var ignored = false // the vlaue is irrelevant for given call


msau42 · 2019-08-06T15:09:39Z

pkg/attacher/attacher.go

@@ -88,11 +85,8 @@ func (a *attacher) Detach(ctx context.Context, volumeID string, nodeID string, s
 		Secrets:  secrets,
 	}

-	_, err = client.ControllerUnpublishVolume(ctx, &req)
-	if err != nil {
-		return isFinalError(err), err


I'm worried this will break detach for plugins that interpret the CSI spec as such:

CSI spec says sp must return NotFound if volume or node is not found: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume-errors.

I guess here we were assuming if NotFound was returned, the detach is actually irrelevant/successful. Two scenarios to consider:

SP automatically handles volume or node not found as detached: no Unpublish call is actually needed to clean up. Should the sp go against the CSI spec and return success?

SP needs ControllerUnpublish called to clean up its state. Still a question of if it should return success or NotFound error afterwards

I added code that interprets NotFound as success and filled container-storage-interface/spec#373 to clarify it in the spec.

Based on CSI spec meeting this morning, it sounds like there is agreement to relax the NotFound error code so that plugins can decide if the unpublish needs to be retried. Based on that, I think we should just retry for any error.

And given the fact that all of the csi plugins we've looked at so far are handling this wrong already, we probably should add an "ACTION REQUIRED" to the release note.

davidz627 · 2019-08-06T18:08:05Z

pkg/attacher/attacher.go

+			// This is not gRPC error.
+			return err
+		}
+		if st.Code() == codes.NotFound {


what about a case where SP temporarily cannot "find" the volume or the disk, but the disk is still attached. This should be NOT_FOUND error that could be remedied by retrying - disk is not actually detached yet.

I would argue that the SP should return OK if it has determined that the volume/node is gone in a way that we can assume the volume is detached. This should be up to the SP to decide.

We're assuming all volume plugins automatically detach volumes when a node is deleted, and I'm not confident that we can assume that.

I think it is a reasonable compromise until this behaviour is clarified in spec. I do not know any volume type where attachment outlives lifecycle of node.

Sorry ignore my last comment, I misread SP as CO. We're on the same page :-)

Either way I think we are asking plugin authors to reinterpret spec in a way which was previously not documented.

Esp. for node deletion case - do we have a solution in the meanwhile?

I think vsphere may have some interesting behaviors around node failure/deletion handling. cc @codenrhoden @vladimirvivien

afaik - in vsphere too, when you delete a node in vcenter, the vmdk files are marked as detached.

jsafrane · 2019-08-07T18:09:35Z

I checked AWS, it most probably returns Internal error code when either node or the volume does not exist. Filed kubernetes-sigs/aws-ebs-csi-driver#330

jsafrane · 2019-08-07T18:11:59Z

Cinder: it returns NotFound when node does not exist: https://github.com/kubernetes/cloud-provider-openstack/blob/1f7b6810a83357b4629ba9eb769f7a27c44f43bf/pkg/csi/cinder/controllerserver.go#L203-L207

jsafrane · 2019-08-07T18:23:30Z

Filed kubernetes/cloud-provider-openstack#718 for OpenStack / Cinder

Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).

jsafrane · 2019-08-08T07:57:52Z

I removed special handling of NotFound, now it's handled as any other error, i.e. volume is assumed as still attached.

jsafrane · 2019-08-08T08:02:02Z

And updated release-note.

msau42 · 2019-08-08T21:42:36Z

/lgtm

@davidz627, I and a few others discussed this, and we came to the conclusion that this change will require a major version bump because we're significantly changing behavior that may require drivers to change.

jsafrane · 2019-08-09T08:16:43Z

this change will require a major version bump because we're significantly changing behavior that may require drivers to change.

At the same time it fixes pretty serious bug that should be fixed in all supported branches. Volumes are marked as detached after "final error" (e.g. Internal due to rate limit on GCE).

I could backport safer version of this patch, marking a volume as detached after NotFound, i.e. the same behavior as is in the released supported versions. All the other errors would lead to retry.

The behavior of the external-attacher recently changed (kubernetes-csi/external-attacher#165) such that it now treats "not found" as real error. The effect was that some Kubernetes E2E tests (like "CSI mock volume CSI workload information using mock driver should not be passed when podInfoOnMount=false") sometimes ran for over 2 minutes, just waiting for detatch. That the test then proceeds without marking the test as failed is a bug in the test cleanup code which will be fixed. This slowdown is not deterministic: sometimes the detach is done early enough while the volume still exists. With this change, the same test completes in under 30 seconds.

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Aug 6, 2019

k8s-ci-robot requested review from msau42 and sbezverk August 6, 2019 08:37

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2019

jsafrane mentioned this pull request Aug 6, 2019

Add e2e test for CSI volume limits kubernetes/kubernetes#80247

Merged

bertinatto reviewed Aug 6, 2019

View reviewed changes

msau42 reviewed Aug 6, 2019

View reviewed changes

jsafrane mentioned this pull request Aug 6, 2019

Install CSI driver on openshift-test start. openshift/origin#23560

Merged

jsafrane force-pushed the fix-detach-error branch from 76d3701 to 679ffb8 Compare August 6, 2019 15:54

davidz627 reviewed Aug 6, 2019

View reviewed changes

davidz627 mentioned this pull request Aug 7, 2019

Clarify that plugin may return OK for ControllerUnpublish if node or volume not found container-storage-interface/spec#375

Merged

Fixed ControllerUnpublish error handling

452b089

Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).

jsafrane force-pushed the fix-detach-error branch from 11ac689 to 452b089 Compare August 8, 2019 07:56

k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 8, 2019

k8s-ci-robot assigned msau42 Aug 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

k8s-ci-robot merged commit f304b8b into kubernetes-csi:master Aug 8, 2019

This was referenced Aug 12, 2019

1.2: Fixed ControllerUnpublish error handling #168

Merged

1.1: Fixed ControllerUnpublish error handling #169

Merged

1.0: Fixed ControllerUnpublish error handling #170

Merged

pohly mentioned this pull request Feb 13, 2020

mock: avoid "not found" error in ControllerUnpublishVolume kubernetes-csi/csi-test#250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed ControllerUnpublish error handling #165

Fixed ControllerUnpublish error handling #165

jsafrane commented Aug 6, 2019 •

edited

Loading

k8s-ci-robot commented Aug 6, 2019

bertinatto commented Aug 6, 2019

bertinatto left a comment

bertinatto Aug 6, 2019

jsafrane Aug 6, 2019

msau42 Aug 6, 2019

jsafrane Aug 6, 2019

msau42 Aug 7, 2019

davidz627 Aug 6, 2019

msau42 Aug 6, 2019

gnufied Aug 6, 2019

msau42 Aug 6, 2019

gnufied Aug 6, 2019

msau42 Aug 6, 2019

gnufied Aug 6, 2019

jsafrane commented Aug 7, 2019 •

edited

Loading

jsafrane commented Aug 7, 2019

jsafrane commented Aug 7, 2019

jsafrane commented Aug 8, 2019 •

edited

Loading

jsafrane commented Aug 8, 2019

msau42 commented Aug 8, 2019

jsafrane commented Aug 9, 2019

Fixed ControllerUnpublish error handling #165

Fixed ControllerUnpublish error handling #165

Conversation

jsafrane commented Aug 6, 2019 • edited Loading

k8s-ci-robot commented Aug 6, 2019

bertinatto commented Aug 6, 2019

bertinatto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Aug 7, 2019 • edited Loading

jsafrane commented Aug 7, 2019

jsafrane commented Aug 7, 2019

jsafrane commented Aug 8, 2019 • edited Loading

jsafrane commented Aug 8, 2019

msau42 commented Aug 8, 2019

jsafrane commented Aug 9, 2019

jsafrane commented Aug 6, 2019 •

edited

Loading

jsafrane commented Aug 7, 2019 •

edited

Loading

jsafrane commented Aug 8, 2019 •

edited

Loading