Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the file_mounts optimization and oslogin #1108

Merged
merged 7 commits into from
Aug 21, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Aug 21, 2022

This PR refactors the file_mounts optimization to make it lazy until all the file mounts are added. The file_mounts optimization will be applied to all clouds now.
It should also fix the oslogin issue mentioned in #1106 (comment) , as previously we place the backup file in the same directory as config_default whose parent directory will be overwritten by ray autoscaler, instead we now place the backup file in ~/.sky/.sky_gcp_config_default on the remote instance.

Hey @lhqing, would you like to switch to this branch git checkout refactor-file-mount-optimization; pip install -e ".[gcp]" and try if this solves the problem?

Tested:

  • ./tests/run_smoke_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @Michaelvll! One question: is delaying the optimization necessary for correctness here (to fix the bug of oslogin)? Or is it just an implementation choice? It seems like if this PR does not add delaying, the oslogin bug would still have been fixed by using a separate path.


Results for for i in {1..3}; do time sky launch --cloud {cloud} -y ''; done:

This PR

sky launch --cloud gcp -y ''  13.29s user 3.21s system 12% cpu 2:11.69 total
sky launch --cloud gcp -y ''  14.60s user 2.99s system 13% cpu 2:14.45 total
sky launch --cloud gcp -y ''  11.05s user 2.17s system 10% cpu 2:06.05 total

sky launch --cloud aws -y ''  8.31s user 1.59s system 5% cpu 2:53.42 total
sky launch --cloud aws -y ''  9.59s user 2.10s system 7% cpu 2:42.28 total
sky launch --cloud aws -y ''  8.17s user 1.54s system 6% cpu 2:31.57 total

Master

sky launch --cloud gcp -y ''  12.90s user 2.89s system 11% cpu 2:20.54 total
sky launch --cloud gcp -y ''  10.51s user 1.98s system 10% cpu 2:04.03 total
sky launch --cloud gcp -y ''  10.89s user 2.08s system 10% cpu 2:08.76 total

sky launch --cloud aws -y ''  8.47s user 1.77s system 6% cpu 2:48.24 total
sky launch --cloud aws -y ''  8.64s user 1.80s system 6% cpu 2:49.91 total
sky launch --cloud aws -y ''  8.37s user 1.73s system 6% cpu 2:42.54 total

AWS doesn't benefit that much, interestingly.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Aug 21, 2022

is delaying the optimization necessary for correctness here (to fix the bug of oslogin)?

yes, it will be important for the correctness, since the authentication step will add the new file into the file mount, but in our previous implementation, the files were already cp to the local tmp directory in the step before the authentication, ie the newly added file will not be uploaded to the cloud.

@Michaelvll Michaelvll merged commit fcfe289 into master Aug 21, 2022
@Michaelvll Michaelvll deleted the refactor-file-mount-optimization branch August 21, 2022 19:38
@lhqing
Copy link
Contributor

lhqing commented Aug 21, 2022

@Michaelvll I confirmed the new fix on a new spot controller. It works for me now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants