Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command to clean the stale subvolume #206

Merged
merged 3 commits into from
Dec 11, 2023
Merged

Conversation

yati1998
Copy link
Contributor

@yati1998 yati1998 commented Nov 20, 2023

  1. Includes command to list and delete the stale subvolume
  2. documentation to understand the usage of the command

@yati1998
Copy link
Contributor Author

@subhamkrai can you please review the PR

@@ -0,0 +1,36 @@
# stale subvolume cleanup
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# stale subvolume cleanup
# Stale subvolume cleanup

## Example

```bash
kubectl rook-ceph subvolume-cleanup ls
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubectl rook-ceph subvolume-cleanup ls
kubectl rook-ceph subvolume-cleanup ls

Why not just call it subvolume instead of subvolume-cleanup ?
The commands would be
subvolume ls
subvolume clean

Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yati1998 could you follow the following doc former https://github.com/rook/kubectl-rook-ceph/blob/master/docs/rook.md to split up the section based on sub-commands.

docs/stale-subvolume.md Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
This would consider all the cases where we can have stale subvolume
and delete them without impacting other resources and attached volumes.

Few cases are mentioned below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these cases cause stale subvolumes? If there is a description of stale subvolumes already in another doc such as ceph docs, you could also link it instead of describing all the scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the explanation in above comment. that may have these corner cases where we can have stale subvolumes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous paragraph sounds good. How about if we just remove this list (lines 9-13)? Seems like we don't need to list all these scenarios.

@yati1998 yati1998 changed the title doc: design doc for stale volume cleanup [WIP]doc: design doc for stale volume cleanup Nov 27, 2023
Comment on lines 42 to 72
fsArgs := []string{
fmt.Sprintf("fs ls --format=json | jq --raw-output '.[].name'"),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fsArgs := []string{
fmt.Sprintf("fs ls --format=json | jq --raw-output '.[].name'"),
}
fsArgs := []string{
fmt.Sprintf("fs ls --format=json | jq --raw-output '.[].name'"),
}

This is incorrect, do not use fmt.Sprintf() in []string{} as it will treat it as single string in bash. To achieve the desired output

fsArgs := []string{
		"fs", "ls", "--format", "json",
	}

get output of above and use json UnMarshal to get the name

This would consider all the cases where we can have stale subvolume
and delete them without impacting other resources and attached volumes.

Few cases are mentioned below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous paragraph sounds good. How about if we just remove this list (lines 9-13)? Seems like we don't need to list all these scenarios.

## clean

```bash
kubectl rook-ceph subvolume clean csi-vol-427774b4-340b-11ed-8d66-0242ac110004
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cleaning" a volume is just deleting it, right? For clarity, IMO we should call it "delete".

Suggested change
kubectl rook-ceph subvolume clean csi-vol-427774b4-340b-11ed-8d66-0242ac110004
kubectl rook-ceph subvolume delete csi-vol-427774b4-340b-11ed-8d66-0242ac110004

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can do that.

// Get all csi reference on k8s side for all cephfs pv
k8sRef := exec.ExecuteBashCommand(getpv)
if k8sRef == "" {
logging.Info("No k8s pvs found. Nothing to do.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to return here?

Suggested change
logging.Info("No k8s pvs found. Nothing to do.")
logging.Info("No k8s pvs found. Nothing to do.")
return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, missed it.

)

var getpvlist = `
kubectl get pv -o jsonpath={.items[*].spec.csi.volumeAttributes.subvolumeName}`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting all PVs of all types in the cluster, not just rbd/cephfs PVs? Or does the subvolumeName in the jsonpath effectively help us filter out all others?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subhamkrai In a separate PR, we need to factor out so kubectl can be replaced by another tool such as oc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting all PVs of all types in the cluster, not just rbd/cephfs PVs? Or does the subvolumeName in the jsonpath effectively help us filter out all others?

no it will not list all pvs, if you see the subvolumename is under csi, hence it will only list down the cephfs/rbd pvs.


func List(ctx context.Context, clientsets *k8sutil.Clientsets, clusterNamespace string) []string {

getpv := fmt.Sprintf(getpvlist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I'd suggest just putting the command directly here, so the dev doesn't need to look above to find what is executed.

Suggested change
getpv := fmt.Sprintf(getpvlist)
getpv := `kubectl get pv -o jsonpath={.items[*].spec.csi.volumeAttributes.subvolumeName}`

fmt.Sprintf("fs ls --format=json | jq --raw-output '.[].name'"),
}

fsList := exec.RunCommandInToolboxPod(ctx, clientsets, "ceph", fsArgs, clusterNamespace, true, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to run in the toolbox instead of the operator pod? Where possible, let's use the operator pod instead of the toolbox so it lessens the dependency on the toolbox.

Suggested change
fsList := exec.RunCommandInToolboxPod(ctx, clientsets, "ceph", fsArgs, clusterNamespace, true, false)
fsList := exec.RunCommandInOperatorPod(ctx, clientsets, "ceph", fsArgs, clusterNamespace, true, false)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing manually we execute into the toolbox pod, hence thought this function does the same. I will try with operator pod.

@@ -67,7 +67,7 @@ func init() {

RootCmd.PersistentFlags().StringVar(&KubeConfig, "kubeconfig", "", "kubernetes config path")
RootCmd.PersistentFlags().StringVar(&OperatorNamespace, "operator-namespace", "", "Kubernetes namespace where rook operator is running")
RootCmd.PersistentFlags().StringVarP(&CephClusterNamespace, "namespace", "n", "rook-ceph", "Kubernetes namespace where CephCluster is created")
RootCmd.PersistentFlags().StringVarP(&CephClusterNamespace, "namespace", "n", "openshift-storage", "Kubernetes namespace where CephCluster is created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to revert

cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
@yati1998 yati1998 force-pushed the stale-volume branch 3 times, most recently from 8e8d9e6 to 9541514 Compare November 28, 2023 06:06
@yati1998
Copy link
Contributor Author

Addressed all the comments, please do have a review. Meanwhile, I am adding another commit to delete the subvolumes.

cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
@yati1998 yati1998 force-pushed the stale-volume branch 3 times, most recently from f8a578f to aae100f Compare November 28, 2023 12:14
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
docs/stale-subvolume.md Outdated Show resolved Hide resolved
pkg/subvolume/delete.go Outdated Show resolved Hide resolved
pkg/subvolume/delete.go Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
## ls

```bash
kubectl rook-ceph stale-subvolume ls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to list stale subvolumes in all filesystems? That seems reasonable, but we might want an optional argument for the name of a specific filesystem. Also, seems like we need to show which filesystem the volume belongs to.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the output shows the filesystem and subvolumegroup to which the subvolume belongs. But again giving users to add option to pass filesystem name is valid but would consider to take in as another PR/enhancement?

docs/stale-subvolume.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we only need to logs the result no need to show all the info stepswise

pkg/subvolume/subvolume.go Outdated Show resolved Hide resolved
## ls

```bash
kubectl rook-ceph stale-subvolume ls
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

pkg/subvolume/delete.go Outdated Show resolved Hide resolved
@yati1998 yati1998 force-pushed the stale-volume branch 2 times, most recently from 1f014fe to faba182 Compare December 11, 2023 13:43
@yati1998
Copy link
Contributor Author

.github/workflows/go-test.yaml Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
cmd/commands/subvolume.go Outdated Show resolved Hide resolved
docs/subvolume.md Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved

var listCmd = &cobra.Command{
Use: "ls",
Short: "Print the list of stale subvolumes no longer in use.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default it prints them all, so we can keep this simple

Suggested change
Short: "Print the list of stale subvolumes no longer in use.",
Short: "Print the list of subvolumes.",


subvolumeNames := getK8sRefSubvolume(ctx, clientsets)
listCephFSSubvolumes(ctx, clientsets, operatorNamespace, clusterNamespace, includeStale, subvolumeNames)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line before a closing bracket

Suggested change

if includeStale {
fmt.Println("Filesystem Subvolume SubvolumeGroup")
} else {
fmt.Println("Filesystem Subvolume SubvolumeGroup State")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Println("Filesystem Subvolume SubvolumeGroup State")
fmt.Println("Filesystem Subvolume SubvolumeGroup Stale")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add it as state, it can be either in-use or stale.

staleSubvolumeNames := make(map[string]subVolumeInfo)
var svList string
if includeStale {
fmt.Println("Filesystem Subvolume SubvolumeGroup")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always print the "Stale" column, right? No need for this?

Suggested change
fmt.Println("Filesystem Subvolume SubvolumeGroup")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printing stale cloumn here won't make sense, as we already pass on the --stale flag. if we do this, then there is no use of adding flag to ls. So IMHO, this is fine.

pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Show resolved Hide resolved
Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In logging, I see fmt.Println is being used in many places, please try to use logging.Info where possible and same for errors and warning. Thank!

@yati1998
Copy link
Contributor Author

In logging, I see fmt.Println is being used in many places, please try to use logging.Info where possible and same for errors and warning. Thank!

As per @travisn comment's have used it and seems valid to me too. We don't want to log the info and just want to print the output.

docs/subvolume.md Outdated Show resolved Hide resolved
docs/subvolume.md Outdated Show resolved Hide resolved
this commit adds a design doc for including cleanup
of stale subvolumes.

Signed-off-by: yati1998 <ypadia@redhat.com>
Copy link
Collaborator

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, I'll let Travis merge the PR. Thanks for contibuting @yati1998

pkg/filesystem/subvolume.go Outdated Show resolved Hide resolved
pkg/filesystem/subvolume.go Show resolved Hide resolved
this commit adds command and script to
cleanup the stale subvolume.

Signed-off-by: yati1998 <ypadia@redhat.com>
this commit adds a ci to test the
subvolume ls and delete command

Signed-off-by: yati1998 <ypadia@redhat.com>
@travisn travisn merged commit 70ede27 into rook:master Dec 11, 2023
5 checks passed
Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for leaving a comment on merged PR.

kubectl rook-ceph ceph fs subvolume create myfs test-subvol group-a
kubectl rook-ceph subvolume ls
kubectl rook-ceph subvolume ls --stale
kubectl rook-ceph subvolume delete test-subvol myfs group-a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be same as create arg fsname sub-volname groupname not the subvolname fsname and groupname

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take care

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 sorry just forgot, this is in subvolname fsname and groupname so that we can add multiple subvolname for same fsname and groupname which will make group deletion easy incase we have many subvols.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt seems right to me, we just stick to how we have for create API call, user can automate the CLI call if he /she thinks they need repetitive operations.

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

posted few questions so that i dont forget it. we might need to discuss on those as well when we get some free time.

# Subvolume cleanup

The subvolume command is used to clean the stale subvolumes
which have no parent-pvc attached to them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we consider the external ceph cluster case to identify the stale subvolumes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need the cluster setup to check it. I did ping ashish , waiting for him on that part.
#211 this is the issue created to complete te remaining scenarios and minor changes .

which have no parent-pvc attached to them.
The command would list out all such subvolumes which needs to be removed.
This would consider all the cases where we can have stale subvolume
and delete them without impacting other resources and attached volumes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this tool uses health check user for the external ceph cluster, AFAIK that user will not have all the required access to do subvolume operations.

## delete

```bash
kubectl rook-ceph subvolume delete csi-vol-427774b4-340b-11ed-8d66-0242ac110004 ocs-storagecluster csi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do still left with stale rados omap objects created for these subvolumes, how those will be cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, @riya-singhal31 will be working on that.
The plan is to get the list of objects and delete the omap objects of the respective subvolume. We will take it as enchancement.

```

```bash
kubectl rook-ceph subvolume delete csi-vol-427774b4-340b-11ed-8d66-0242ac110004,csi-vol-427774b4-340b-11ed-8d66-0242ac110005 ocs-storagecluster csi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will not be able to do the delete operation when the ceph cluster is full, Are we planning to handle stale resources cleanup in that case as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has not been considered but yeah we can work on it.
cc @riya-singhal31

Comment on lines +55 to +57
if pv.Spec.CSI.VolumeAttributes["subvolumeName"] != "" {
subvolumeNames[pv.Spec.CSI.VolumeAttributes["subvolumeName"]] = subVolumeInfo{}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might panic if we have non CSI PV as CSI is a pointer. please check

Comment on lines +86 to +88
if ok || checkSnapshot(ctx, clientsets, operatorNamespace, clusterNamespace, fs.Name, sv.Name, svg.Name) {
// The volume is not stale if a PV was found, or it has a snapshot
stale = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have a stale subvolume with a snapshot as well. we cannot consider if a subvolume has a snapshot as not a stale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But deleting the subvolume if it has a snapshot can lead to data loss, hence the plan is to let them know that the subvolume has snapshot so it is good to delete the snapshot manually and then re-try.
What do you think should be an ideal case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can delete the subvolume with –retain-snapshots flag, this is take care of cleanup when the snapshot is deleted. this is what we do in cephcsi as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure will test and update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants