Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user identity to cluster status to avoid leakage when switching account #1513

Merged
merged 87 commits into from
Dec 20, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 12, 2022

Describe the changes in this PR:

With #1489, when a user switches account, sky status -r will remove the previously launched cluster from the table due to the cluster cannot be found in the new account, causing resource leakage. It could happen whenever user switches account on any cloud. This is to get rid of the problem.


IMPORTANT: A user with multiple identities for a cloud need to do the following things after upgrading to this PR:

  1. Switch to the account used for launching the clusters in the cluster table
  2. sky status -r so that the owner information can be correctly updated in our cluster table cache.

A normal user will not be affected by the PR.


Tested (run the relevant ones):

  • sky launch -c min --cloud aws; switch to another account; sky status -r

image

  • sky launch -c min --cloud aws; switch to another account; sky launch -c min; sky down min; sky stop min; sky start min
  • sky launch -c min --cloud gcp; switch to another account; sky status -r

image

image

  • sky launch -c min --cloud azure; switch to another account; sky status -r

image

@Michaelvll Michaelvll force-pushed the add-user-identity-to-cluster branch 2 times, most recently from a3c6295 to b0578e2 Compare December 12, 2022 07:37
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Dec 18, 2022

Tested (d0f1888):

  • tests/run_smoke_tests.sh
  • tests/backward_compatibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; did a pass over the new changes.

Minor: Can we make the following errors for various CLI calls return exit code 1?

» AWS_PROFILE=AdministratorAccess-1234 sky start jump3                                       
Restarting 1 cluster: jump3. Proceed? [Y/n]:
Cluster 'jump3' (AWS) is owned by account 'yyyyy', but the currently activated account is 'xxxxx'.

sky/cli.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/core.py Outdated
Comment on lines 363 to 373
except exceptions.ClusterNotUpError as e:
with ux_utils.print_exception_no_traceback():
e.message += (
f'\n auto{option_str} can only be set/unset for '
f'{global_user_state.ClusterStatus.UP.value} clusters.')
raise e from None
except exceptions.NotSupportedError as e:
with ux_utils.print_exception_no_traceback():
e.message += (f'\n auto{option_str} is only supported by backend: '
f'{backends.CloudVmRayBackend.NAME}')
raise e from None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to establish the convention that programmatic APIs should throw full stacktraces if possible (to ease debugging)? I think it's ok to defer to the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point! In that case, we may want to change the behavior of ux_utils.print_exception_no_traceback based on the entrypoint (CLI or programmatic API)

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@concretevitamin
Copy link
Member

The problem may not be related to this PR as we don't change the generated ray yaml for the cluster. Do you think it is because the jump branch contains some additional settings in the ray yaml that is different from the one generated by the current branch causing the hash difference and resource leakage?

That's right! Thanks.


Potential issue

(1) Intentionally passing a wrong profile string AdministratorAccess-1234

» AWS_PROFILE=AdministratorAccess-1234 sky queue
Fetching and parsing job queue...
getting the job queue cluster 'jump3' (status: STOPPED)... skipped.
getting the job queue cluster 'smoke' (status: STOPPED)... skipped.

W 12-18 17:53:52 backend_utils.py:1900] Failed to refresh the cluster status, it is not fatal, but getting the job queue cluster 'sky-spot-controller-xxx' might hang if the cluster is not up.
W 12-18 17:53:52 backend_utils.py:1900] Detailed reason: Failed to get AWS user identity with unknown exception: <class 'botocore.exceptions.ProfileNotFound'> The config profile (AdministratorAccess-1234) could not be found.
getting the job queue cluster 'sky-spot-controller-xxx' (status: STOPPED)... skipped.
  1. It's surprising this warning is printed for spot controller but not for the other two clusters. Is it needed?
  2. (Related to an existing comment) "but getting the job queue cluster" - grammar is tricky.

(2) Passing the correct profile string <correct>

» AWS_PROFILE=<correct> sky queue                                                  1 ↵
Fetching and parsing job queue...
getting the job queue cluster 'jump3' (status: STOPPED)... skipped.
getting the job queue cluster 'smoke' (status: STOPPED)... skipped.

W 12-18 17:55:09 backend_utils.py:1900] Failed to refresh the cluster status, it is not fatal, but getting the job queue cluster 'sky-spot-controller-xxx' might hang if the cluster is not up.
W 12-18 17:55:09 backend_utils.py:1900] Detailed reason: Cluster 'sky-spot-controller-xxx' (AWS) is owned by account 'yyyy', but the currently activated account is 'zzzz'.
getting the job queue cluster 'sky-spot-controller-xxx' (status: STOPPED)... skipped.

Aside: I now think that carefully allowing certain operations on a non-owned cluster may not be worth the effort. This is because it's harder for us to get right or for users to remember the logic. On the other hand, if it's clear cut the user doesn't need to learn new concepts: if one doesn't own a cluster, no operations are allowed. That said, let's ship it and see what user feedback is ;)

@Michaelvll
Copy link
Collaborator Author

Thanks for the review @concretevitamin!

Minor: Can we make the following errors for various CLI calls return exit code 1?

This is a good call, but a bit difficult to decide the boundary. Since our sky start can operate on multiple clusters, do you think we should set the exit code to 1 if one of the clusters fails due to the identity problem?

It's surprising this warning is printed for spot controller but not for the other two clusters. Is it needed?

Is it because only the spot controller has the autostop set up? We will only check the identity when we need to query the cloud CLI. For those clusters without autostop set up, there is no need to check or warn since the operator should work. However, for the cluster with autostop set up, there is no guarantee that the following codes will work, i.e. the warning might be useful as a disclaimer. Wdyt?

(Related to an existing comment) "but getting the job queue cluster" - grammar is tricky.

Modified the logs based on the suggestions in the previous comments. The logging for a non-UP cluster will look like the following:

sky logs min
sky.exceptions.ClusterNotUpError: Tailing logs: skipped for cluster 'min' (status: STOPPED). It is only allowed for UP clusters.

I agree that the user may find the boundary a bit complicated, and we can modify it based on the feedback. One thing that may be noted is that we are not making multi-account officially supported after this PR, as that requires more detailed edge case handling. Instead, I would rather regard this PR as a safeguard for the user trying to switch from one account to another entirely and stay with the new account. ; )

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Dec 19, 2022

As discussed offline, we decide to ban all operations for the identity mismatch for now. PTAL. The new behavior will be:

AWS_PROFILE=admin-user sky logs min
sky.exceptions.ClusterOwnerIdentityMismatchError: Cluster 'min' (AWS) is owned by account '679991763071', but the currently activated account is 'AROATWUGBKI524LKNFJEB:admin-user'.

image

image

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great work @Michaelvll!

sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/utils/common_utils.py Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/core.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Dec 20, 2022

Thanks for the excellent review @concretevitamin! Just tested again with the smoke tests and it works. (7874618)
Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants