Add user identity to cluster status to avoid leakage when switching account #1513

Michaelvll · 2022-12-12T03:51:07Z

Describe the changes in this PR:

With #1489, when a user switches account, sky status -r will remove the previously launched cluster from the table due to the cluster cannot be found in the new account, causing resource leakage. It could happen whenever user switches account on any cloud. This is to get rid of the problem.

IMPORTANT: A user with multiple identities for a cloud need to do the following things after upgrading to this PR:

Switch to the account used for launching the clusters in the cluster table
sky status -r so that the owner information can be correctly updated in our cluster table cache.

A normal user will not be affected by the PR.

Tested (run the relevant ones):

sky launch -c min --cloud aws; switch to another account; sky status -r

sky launch -c min --cloud aws; switch to another account; sky launch -c min; sky down min; sky stop min; sky start min
sky launch -c min --cloud gcp; switch to another account; sky status -r

sky launch -c min --cloud azure; switch to another account; sky status -r

./tests/run_smoke_tests.sh (except for TPU VM tests Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500 and list source files [Storage] Uploading list of files to bucket fails #1510).
./tests/backward_compatibility.sh

Michaelvll · 2022-12-18T07:47:56Z

Tested (d0f1888):

tests/run_smoke_tests.sh
tests/backward_compatibility_tests.sh

concretevitamin

Thanks; did a pass over the new changes.

Minor: Can we make the following errors for various CLI calls return exit code 1?

» AWS_PROFILE=AdministratorAccess-1234 sky start jump3                                       
Restarting 1 cluster: jump3. Proceed? [Y/n]:
Cluster 'jump3' (AWS) is owned by account 'yyyyy', but the currently activated account is 'xxxxx'.

sky/cli.py

sky/backends/backend_utils.py

concretevitamin · 2022-12-18T08:50:50Z

sky/core.py

+    except exceptions.ClusterNotUpError as e:
+        with ux_utils.print_exception_no_traceback():
+            e.message += (
+                f'\n  auto{option_str} can only be set/unset for '
+                f'{global_user_state.ClusterStatus.UP.value} clusters.')
+            raise e from None
+    except exceptions.NotSupportedError as e:
+        with ux_utils.print_exception_no_traceback():
+            e.message += (f'\n  auto{option_str} is only supported by backend: '
+                          f'{backends.CloudVmRayBackend.NAME}')
+            raise e from None


Does it make sense to establish the convention that programmatic APIs should throw full stacktraces if possible (to ease debugging)? I think it's ok to defer to the future.

That is a good point! In that case, we may want to change the behavior of ux_utils.print_exception_no_traceback based on the entrypoint (CLI or programmatic API)

sky/backends/backend_utils.py

concretevitamin · 2022-12-18T10:04:50Z

The problem may not be related to this PR as we don't change the generated ray yaml for the cluster. Do you think it is because the jump branch contains some additional settings in the ray yaml that is different from the one generated by the current branch causing the hash difference and resource leakage?

That's right! Thanks.

Potential issue

(1) Intentionally passing a wrong profile string AdministratorAccess-1234

» AWS_PROFILE=AdministratorAccess-1234 sky queue
Fetching and parsing job queue...
getting the job queue cluster 'jump3' (status: STOPPED)... skipped.
getting the job queue cluster 'smoke' (status: STOPPED)... skipped.

W 12-18 17:53:52 backend_utils.py:1900] Failed to refresh the cluster status, it is not fatal, but getting the job queue cluster 'sky-spot-controller-xxx' might hang if the cluster is not up.
W 12-18 17:53:52 backend_utils.py:1900] Detailed reason: Failed to get AWS user identity with unknown exception: <class 'botocore.exceptions.ProfileNotFound'> The config profile (AdministratorAccess-1234) could not be found.
getting the job queue cluster 'sky-spot-controller-xxx' (status: STOPPED)... skipped.

It's surprising this warning is printed for spot controller but not for the other two clusters. Is it needed?
(Related to an existing comment) "but getting the job queue cluster" - grammar is tricky.

(2) Passing the correct profile string <correct>

» AWS_PROFILE=<correct> sky queue                                                  1 ↵
Fetching and parsing job queue...
getting the job queue cluster 'jump3' (status: STOPPED)... skipped.
getting the job queue cluster 'smoke' (status: STOPPED)... skipped.

W 12-18 17:55:09 backend_utils.py:1900] Failed to refresh the cluster status, it is not fatal, but getting the job queue cluster 'sky-spot-controller-xxx' might hang if the cluster is not up.
W 12-18 17:55:09 backend_utils.py:1900] Detailed reason: Cluster 'sky-spot-controller-xxx' (AWS) is owned by account 'yyyy', but the currently activated account is 'zzzz'.
getting the job queue cluster 'sky-spot-controller-xxx' (status: STOPPED)... skipped.

Aside: I now think that carefully allowing certain operations on a non-owned cluster may not be worth the effort. This is because it's harder for us to get right or for users to remember the logic. On the other hand, if it's clear cut the user doesn't need to learn new concepts: if one doesn't own a cluster, no operations are allowed. That said, let's ship it and see what user feedback is ;)

Michaelvll · 2022-12-19T00:29:02Z

Thanks for the review @concretevitamin!

Minor: Can we make the following errors for various CLI calls return exit code 1?

This is a good call, but a bit difficult to decide the boundary. Since our sky start can operate on multiple clusters, do you think we should set the exit code to 1 if one of the clusters fails due to the identity problem?

It's surprising this warning is printed for spot controller but not for the other two clusters. Is it needed?

Is it because only the spot controller has the autostop set up? We will only check the identity when we need to query the cloud CLI. For those clusters without autostop set up, there is no need to check or warn since the operator should work. However, for the cluster with autostop set up, there is no guarantee that the following codes will work, i.e. the warning might be useful as a disclaimer. Wdyt?

(Related to an existing comment) "but getting the job queue cluster" - grammar is tricky.

Modified the logs based on the suggestions in the previous comments. The logging for a non-UP cluster will look like the following:

sky logs min
sky.exceptions.ClusterNotUpError: Tailing logs: skipped for cluster 'min' (status: STOPPED). It is only allowed for UP clusters.

I agree that the user may find the boundary a bit complicated, and we can modify it based on the feedback. One thing that may be noted is that we are not making multi-account officially supported after this PR, as that requires more detailed edge case handling. Instead, I would rather regard this PR as a safeguard for the user trying to switch from one account to another entirely and stay with the new account. ; )

Michaelvll · 2022-12-19T05:35:12Z

As discussed offline, we decide to ban all operations for the identity mismatch for now. PTAL. The new behavior will be:

AWS_PROFILE=admin-user sky logs min
sky.exceptions.ClusterOwnerIdentityMismatchError: Cluster 'min' (AWS) is owned by account '679991763071', but the currently activated account is 'AROATWUGBKI524LKNFJEB:admin-user'.

concretevitamin

LGTM, thanks for the great work @Michaelvll!

sky/backends/backend_utils.py

sky/utils/common_utils.py

sky/spot/spot_utils.py

sky/core.py

sky/cli.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

…amin/sky-experiments into add-user-identity-to-cluster

…nto add-user-identity-to-cluster

Michaelvll · 2022-12-20T04:12:35Z

Thanks for the excellent review @concretevitamin! Just tested again with the smoke tests and it works. (7874618)
Merging.

Michaelvll added 3 commits December 11, 2022 21:25

add user identity check

028340b

fix

f3a5488

fix

b164b73

Michaelvll force-pushed the add-user-identity-to-cluster branch from 176b3a7 to b164b73 Compare December 12, 2022 05:25

Michaelvll added 2 commits December 11, 2022 21:37

continue when error happens in status refresh

a393f54

format

aff8e61

Michaelvll force-pushed the add-user-identity-to-cluster branch from fdbd895 to aff8e61 Compare December 12, 2022 05:41

Michaelvll added 2 commits December 11, 2022 21:47

fix

cfd8ff3

fix

e6c7b0c

Michaelvll force-pushed the add-user-identity-to-cluster branch from 6a193be to e6c7b0c Compare December 12, 2022 05:58

Michaelvll added 3 commits December 11, 2022 21:59

fix table output

4235517

check the identity earlier

3f75d73

fix

f006c81

Michaelvll force-pushed the add-user-identity-to-cluster branch 2 times, most recently from a3c6295 to b0578e2 Compare December 12, 2022 07:37

fix

5446452

Michaelvll force-pushed the add-user-identity-to-cluster branch from b0578e2 to 5446452 Compare December 12, 2022 07:41

Michaelvll added 3 commits December 11, 2022 23:47

supress exception

b719495

update message

aa59b5a

show old status in the table

f07629a

Michaelvll force-pushed the add-user-identity-to-cluster branch from 0b9429f to f07629a Compare December 12, 2022 07:54

Michaelvll added 3 commits December 12, 2022 00:12

fix message

978ebd4

fix refresh

c2a42ce

Fix test smoke

cf6df3c

Michaelvll marked this pull request as ready for review December 12, 2022 20:06

Michaelvll mentioned this pull request Dec 12, 2022

[AWS SSO] Use service account to access EC2 and S3 on head and worker nodes #1489

Merged

8 tasks

Michaelvll added 4 commits December 12, 2022 14:15

Avoid thread-safety issue for creating aws client

20a5f4e

rename

a49254c

Handle unknown exceptions

b17f5b3

Reuse the aws utils

4586039

Michaelvll requested a review from concretevitamin December 18, 2022 07:37

Fix UX

718802a

concretevitamin reviewed Dec 18, 2022

View reviewed changes

Michaelvll added 4 commits December 18, 2022 15:42

Address comments

82c8e35

fix

53c755d

Fix aws v2

c589ab6

Change logs

577a38f

Michaelvll force-pushed the add-user-identity-to-cluster branch from dd6171d to 577a38f Compare December 19, 2022 00:24

Michaelvll requested a review from concretevitamin December 19, 2022 00:34

Michaelvll added 3 commits December 18, 2022 21:04

Disallow all operations with different user identity

c9899e4

Better logging

e451078

fix comment

ddb401d

better logging

3645923

concretevitamin approved these changes Dec 19, 2022

View reviewed changes

Michaelvll and others added 6 commits December 19, 2022 09:45

Update sky/backends/backend_utils.py

a1b9c7f

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

address comments

9b89b0a

Merge branch 'add-user-identity-to-cluster' of github.com:concretevit…

aa61492

…amin/sky-experiments into add-user-identity-to-cluster

lint

5f54855

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

7ed02d3

…nto add-user-identity-to-cluster

solve error

7874618

Michaelvll merged commit 95abfd0 into master Dec 20, 2022

Michaelvll deleted the add-user-identity-to-cluster branch December 20, 2022 04:55

This was referenced Dec 20, 2022

[User identity] Fix identity check #1550

Merged

[TPU VM] Avoid infinite recursion when cleaning up spot TPU VM #1555

Merged

romilbhardwaj mentioned this pull request Jan 7, 2023

[State] Disable WAL when running on WSL #1574

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add user identity to cluster status to avoid leakage when switching account #1513

Add user identity to cluster status to avoid leakage when switching account #1513

Michaelvll commented Dec 12, 2022 •

edited

Loading

Michaelvll commented Dec 18, 2022 •

edited

Loading

concretevitamin left a comment

concretevitamin Dec 18, 2022

Michaelvll Dec 19, 2022

concretevitamin commented Dec 18, 2022

Michaelvll commented Dec 19, 2022

Michaelvll commented Dec 19, 2022 •

edited

Loading

concretevitamin left a comment

Michaelvll commented Dec 20, 2022 •

edited

Loading

Add user identity to cluster status to avoid leakage when switching account #1513

Add user identity to cluster status to avoid leakage when switching account #1513

Conversation

Michaelvll commented Dec 12, 2022 • edited Loading

Michaelvll commented Dec 18, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Dec 18, 2022

Choose a reason for hiding this comment

Michaelvll Dec 19, 2022

Choose a reason for hiding this comment

concretevitamin commented Dec 18, 2022

Michaelvll commented Dec 19, 2022

Michaelvll commented Dec 19, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Dec 20, 2022 • edited Loading

Michaelvll commented Dec 12, 2022 •

edited

Loading

Michaelvll commented Dec 18, 2022 •

edited

Loading

Michaelvll commented Dec 19, 2022 •

edited

Loading

Michaelvll commented Dec 20, 2022 •

edited

Loading