Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed ControllerUnpublish error handling #165

Merged
merged 1 commit into from
Aug 8, 2019

Conversation

jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Aug 6, 2019

Any error from ControllerUnpublish can mean that a volume could be still attached (or being detached).

What type of PR is this?
/kind bug

Which issue(s) this PR fixes:
Fixes #164

Does this PR introduce a user-facing change?:

Action required: processing of ControllerUnpublish errors has changed. CSI drivers SHALL return success (0), when a deleted node or volume implies that the volume is detached from the node. The external attacher treats NotFound error as any other error and it assumes that the volume may still be attached to the node. Please check behavior of your CSI driver and fix it accordingly.

cc @kubernetes-csi/csi-misc

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Aug 6, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 6, 2019
@bertinatto
Copy link
Contributor

/retest

Copy link
Contributor

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (I might not be aware of all consequences of this change, though)

@@ -266,6 +266,7 @@ func TestCSIHandler(t *testing.T) {
var success error
var readWrite = false
var readOnly = true
var ignored = false // the vlaue is irrelevant for given call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: vlaue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -88,11 +85,8 @@ func (a *attacher) Detach(ctx context.Context, volumeID string, nodeID string, s
Secrets: secrets,
}

_, err = client.ControllerUnpublishVolume(ctx, &req)
if err != nil {
return isFinalError(err), err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried this will break detach for plugins that interpret the CSI spec as such:

CSI spec says sp must return NotFound if volume or node is not found: https://github.com/container-storage-interface/spec/blob/master/spec.md#controllerunpublishvolume-errors.

I guess here we were assuming if NotFound was returned, the detach is actually irrelevant/successful. Two scenarios to consider:

  1. SP automatically handles volume or node not found as detached: no Unpublish call is actually needed to clean up. Should the sp go against the CSI spec and return success?

  2. SP needs ControllerUnpublish called to clean up its state. Still a question of if it should return success or NotFound error afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added code that interprets NotFound as success and filled container-storage-interface/spec#373 to clarify it in the spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on CSI spec meeting this morning, it sounds like there is agreement to relax the NotFound error code so that plugins can decide if the unpublish needs to be retried. Based on that, I think we should just retry for any error.

And given the fact that all of the csi plugins we've looked at so far are handling this wrong already, we probably should add an "ACTION REQUIRED" to the release note.

// This is not gRPC error.
return err
}
if st.Code() == codes.NotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about a case where SP temporarily cannot "find" the volume or the disk, but the disk is still attached. This should be NOT_FOUND error that could be remedied by retrying - disk is not actually detached yet.

I would argue that the SP should return OK if it has determined that the volume/node is gone in a way that we can assume the volume is detached. This should be up to the SP to decide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're assuming all volume plugins automatically detach volumes when a node is deleted, and I'm not confident that we can assume that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a reasonable compromise until this behaviour is clarified in spec. I do not know any volume type where attachment outlives lifecycle of node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry ignore my last comment, I misread SP as CO. We're on the same page :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way I think we are asking plugin authors to reinterpret spec in a way which was previously not documented.

Esp. for node deletion case - do we have a solution in the meanwhile?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think vsphere may have some interesting behaviors around node failure/deletion handling. cc @codenrhoden @vladimirvivien

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik - in vsphere too, when you delete a node in vcenter, the vmdk files are marked as detached.

@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 7, 2019

I checked AWS, it most probably returns Internal error code when either node or the volume does not exist. Filed kubernetes-sigs/aws-ebs-csi-driver#330

@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 7, 2019

@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 7, 2019

Filed kubernetes/cloud-provider-openstack#718 for OpenStack / Cinder

Any error from ControllerUnpublish can mean that a volume could be still
attached (or being detached).
@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 8, 2019

I removed special handling of NotFound, now it's handled as any other error, i.e. volume is assumed as still attached.

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Aug 8, 2019
@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 8, 2019

And updated release-note.

@msau42
Copy link
Collaborator

msau42 commented Aug 8, 2019

/lgtm

@davidz627, I and a few others discussed this, and we came to the conclusion that this change will require a major version bump because we're significantly changing behavior that may require drivers to change.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019
@k8s-ci-robot k8s-ci-robot merged commit f304b8b into kubernetes-csi:master Aug 8, 2019
@jsafrane
Copy link
Contributor Author

jsafrane commented Aug 9, 2019

this change will require a major version bump because we're significantly changing behavior that may require drivers to change.

At the same time it fixes pretty serious bug that should be fixed in all supported branches. Volumes are marked as detached after "final error" (e.g. Internal due to rate limit on GCE).

I could backport safer version of this patch, marking a volume as detached after NotFound, i.e. the same behavior as is in the released supported versions. All the other errors would lead to retry.

pohly added a commit to pohly/csi-test that referenced this pull request Feb 13, 2020
The behavior of the external-attacher recently changed
(kubernetes-csi/external-attacher#165) such
that it now treats "not found" as real error.

The effect was that some Kubernetes E2E tests (like "CSI mock volume
CSI workload information using mock driver should not be passed when
podInfoOnMount=false") sometimes ran for over 2 minutes, just waiting
for detatch. That the test then proceeds without marking the test as
failed is a bug in the test cleanup code which will be
fixed.

This slowdown is not deterministic: sometimes the detach is done early
enough while the volume still exists.

With this change, the same test completes in under 30 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

external-attacher marks volumes as detached after final error
6 participants