Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autodown] Support for autodown #1217

Merged
merged 38 commits into from
Oct 14, 2022
Merged

[Autodown] Support for autodown #1217

merged 38 commits into from
Oct 14, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 10, 2022

Closes #954.

Please refer to the following tests for the design of the CLI.
Main changes:

  1. Add an option --terminate forsky launch and sky start
  2. Add an option --terminate for sky autostop
  3. sky.launch replaces the teardown option with terminate.

Tested:

  • sky launch -c test-autodown -i 1 --terminate; sky status -r
  • sky cpunode -c test-autodown --cloud gcp; sky autostop --terminate test-autodown; sky status -r
  • sky cpunode -c test-autodown; sky stop test-autodown; sky start test-autodown -i 1 --terminate; sky status -r
  • tests/run_smoke_tests.sh test_autodown
  • sky launch -c test-autodown -i 2 --terminate; sky status
NAME             LAUNCHED        RESOURCES             STATUS   AUTOSTOP      COMMAND                         
test-autodown    a few secs ago  1x AWS(m6i.2xlarge)   UP       2 min (down)  sky launch -c test-autodown...  
smoke-test-zhwu  4 days ago      1x GCP(n1-highmem-8)  STOPPED  -             sky start smoke-test-zhwu      

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @Michaelvll! Some comments about user-facing options/help strings.

We may need a smoke test for

  • sky launch --terminate
  • sky launch --terminate -i n

sky/cli.py Outdated
@@ -992,6 +994,15 @@ def cli():
'Setting this flag is equivalent to '
'running ``sky launch -d ...`` and then ``sky autostop -i <minutes>``'
'. If not set, the cluster will not be auto-stopped.'))
@click.option(
'--terminate',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about the alternative of --rm? (E.g., use this name in user interfaces; in the backend we can keep terminate)

Pros:

  • Popularized by docker: https://docs.docker.com/engine/reference/run/#clean-up---rm
  • For some reason --terminate to me makes sense only for serverful settings. I can imagine in the future sky launch may gain the ability to launch serverless jobs. In that case, "--terminate a job after it exits" is a bit confusing imo.
    • Feel free to push back as (1) this seems a weak argument (2) --rm may not be that much better.
  • Faster to type.

Another alternative is --down.

Pros:

  • Matches sky down.
  • Matches the colloquial "autodown".
  • Faster to type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to keep this consistent with the option in sky.launch, so that the user do not need to learn another set of function parameters when using the programetic API. I feel that the rm can be a bit confusing when appear in the sky.launch(..., rm=True, ...). Do you think replacing both the --terminate and sky.launch(..., terminate=True, ...) with --down and sky.launch(..., down=True, ...) will be better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c: - --down >> --terminate >> --rm

--down gets highest preference because its consistent with sky down. I'm a little apprehensive of --terminate mainly because it adds a new word terminate to user vocabulary, which has the same effect as sky down. Since down is a fundamental operation, I think it's a good idea to maintain consistency here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--down sounds great.

Minor: I see that some user-facing help strings get "terminate" replaced with "tear down". That seems okay but the former is much more common in cloud providers parlance (e.g., AWS console).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I found we are using both tear down and terminate in our docs. I would prefer tear down, since it is closer to the --down or sky down. I went through the docstr's, and tried to make them more consistent, but still keep some of the terminate. PTAL. : )

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/execution.py Outdated Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated
@@ -992,6 +994,15 @@ def cli():
'Setting this flag is equivalent to '
'running ``sky launch -d ...`` and then ``sky autostop -i <minutes>``'
'. If not set, the cluster will not be auto-stopped.'))
@click.option(
'--terminate',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--down sounds great.

Minor: I see that some user-facing help strings get "terminate" replaced with "tear down". That seems okay but the former is much more common in cloud providers parlance (e.g., AWS console).

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a complete pass. A few comments.

tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
sky/core.py Show resolved Hide resolved
sky/execution.py Outdated
Comment on lines 230 to 233
if stages is None or Stage.DOWN in stages:
if down and idle_minutes_to_autostop is None:
backend.teardown_ephemeral_storage(task)
backend.teardown(handle, terminate=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will clean up after the job fails, but if the cluster fails at the setup/filemount.
During adding this I found that it previously will not terminate the cluster when the user ctrl-c'ed during the logging for the execution. I just moved the teardown into the finally: ... block to make sure the clean up will happen after the ctrl-c. Wdyt?

Do you mean that by putting this block in this finally, we will tear down when users ctrl-c a running job's streaming log? That seems unintuitive. More intuitive thing to do seems to be terminate it whenever the job finishes.

By "but if the cluster fails at the setup/filemount." - do you mean if these happen, the cluster will remain up? This is important to clarify in all func docstrs/help strs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my bad, I meant when the --down or sky.launch(..., down=True, ...) is specified, the cluster should be terminated if the user press ctrl-c during setup or file mounts. After thinking more about this, I think it should be fine to place it back in its original place.

sky/skylet/events.py Show resolved Hide resolved
sky/skylet/events.py Show resolved Hide resolved
@Michaelvll Michaelvll changed the title Support for autodown [Autostop] Support for autodown Oct 12, 2022
@Michaelvll Michaelvll changed the title [Autostop] Support for autodown [Autodown] Support for autodown Oct 12, 2022
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nits -- mostly to ensure user-facing messages are clear.

sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
tests/test_smoke.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/execution.py Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a pass - mostly nits, some questions on semantics.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/core.py Outdated Show resolved Hide resolved
sky/execution.py Show resolved Hide resolved
sky/execution.py Outdated
if isinstance(backend, backends.CloudVmRayBackend):
if down and idle_minutes_to_autostop is None:
# Use autostop(down) to terminate the cluster after the task is
# done. Otherwise, the cluster will be immediately terminated when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdym by "Otherwise, the cluster will be immediately terminated when # the user detach from the execution."? Can be omitted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By that, I mean without the logic here, if the detach_run is specified, the function will immediately terminate the cluster without waiting the job to finish.
It should be fine to omit it. Done. Thanks!

sky/execution.py Outdated
if idle_minutes_to_autostop is not None:
if idle_minutes_to_autostop == 0:
verb = 'torn down' if down else 'stopped'
logger.warning(f'{colorama.Fore.LIGHTBLACK_EX}Setting '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about printing this warning only for -i 0 but not for --down? The latter is a "normal" use case so shouldn't see a warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason for showing this logging for --down is that it can help reduce the confusion caused by seeing autostop 1m (down) after using sky launch --down. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about logger.info and say something like "The cluster will be {verb}-ed after 1 minute of idleness (after all jobs finish)."?

This is to (1) avoid "warning" for a normal use case, and (2) not show the reason, as it's hard for users to understand.

I still think not showing this for --down is better as the UX is simpler. But up to you!

sky/execution.py Outdated
@@ -271,7 +288,7 @@ def launch(
auto-generate a name.
retry_until_up: whether to retry launching the cluster until it is
up.
idle_minutes_to_autostop: if provided, the cluster will be auto-stop
idle_minutes_to_autostop: if provided, the cluster will be autostop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep consistent with cli?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also added the docstr for the other argument as well. I think it may require another PR to consolidate the docstr's for the programmatic API.

sky/execution.py Outdated
verb = 'torn down' if down else 'stopped'
logger.warning(f'{colorama.Fore.LIGHTBLACK_EX}Setting '
'idle_minutes_to_autostop to 1, to avoid '
f'cluster being {verb} during task submission.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would -i 0 have this issue of being torn down during task submission? Is that a bug in how the autostop event is implemented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that after we set the autostop in the PRE_EXEC stage with -i 0, it could be possible that the cluster immediately found itself have no task running and start the autostop/down process, before the task is submitted in the EXEC stage.
I believe the behavior is expected. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Can we copy this comment into the code where we reset the var to 1?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @Michaelvll!

sky/execution.py Outdated
verb = 'torn down' if down else 'stopped'
logger.warning(f'{colorama.Fore.LIGHTBLACK_EX}Setting '
'idle_minutes_to_autostop to 1, to avoid '
f'cluster being {verb} during task submission.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Can we copy this comment into the code where we reset the var to 1?

sky/execution.py Outdated
if idle_minutes_to_autostop is not None:
if idle_minutes_to_autostop == 0:
verb = 'torn down' if down else 'stopped'
logger.warning(f'{colorama.Fore.LIGHTBLACK_EX}Setting '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about logger.info and say something like "The cluster will be {verb}-ed after 1 minute of idleness (after all jobs finish)."?

This is to (1) avoid "warning" for a normal use case, and (2) not show the reason, as it's hard for users to understand.

I still think not showing this for --down is better as the UX is simpler. But up to you!

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 14, 2022

Tested:

  • tests/run_smoke_tests.sh
  • sky launch -c test-autodown --down; sky status
NAME             LAUNCHED        RESOURCES             STATUS  AUTOSTOP   COMMAND                         
test-autodown    a few secs ago  1x AWS(m6i.2xlarge)   UP      1m (down)  sky launch -c test-autodown...  
smoke-test-zhwu  1 day ago       1x GCP(n1-highmem-8)  UP      -          sky start smoke-test-zhwu   

@Michaelvll Michaelvll merged commit 72e2837 into master Oct 14, 2022
@Michaelvll Michaelvll deleted the auto-down branch October 14, 2022 09:28
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
* Support for autodown

* Change API to terminate

* fix flag

* fix autostop

* fix comment

* address comment

* address comment

* format

* Rename terminate to down

* add smoke test

* fix autodown for multi-node

* format

* fix syntax

* use gcp for autodown test

* fix smoke test

* fix smoke test

* address comments

* Add comment

* Switch back to terminate

* fix comments

* Change back to tear down

* Change to tear down

* fix comment

* change the logic of --down to use auto-down by default

* Use autodown for --down and address comments

* fix comment

* fix ux

* Add test for cancel

* fix UX

* fix test_smoke

* address comments

* fix

* fix logging and comment

* fix environment variable overwrite

* fix smoke test

* print info
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
* Support for autodown

* Change API to terminate

* fix flag

* fix autostop

* fix comment

* address comment

* address comment

* format

* Rename terminate to down

* add smoke test

* fix autodown for multi-node

* format

* fix syntax

* use gcp for autodown test

* fix smoke test

* fix smoke test

* address comments

* Add comment

* Switch back to terminate

* fix comments

* Change back to tear down

* Change to tear down

* fix comment

* change the logic of --down to use auto-down by default

* Use autodown for --down and address comments

* fix comment

* fix ux

* Add test for cancel

* fix UX

* fix test_smoke

* address comments

* fix

* fix logging and comment

* fix environment variable overwrite

* fix smoke test

* print info
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design auto-down, similar to auto-stop
3 participants