Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autodown] Support for autodown #1217

Merged
merged 38 commits into from
Oct 14, 2022
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
c54ef6c
Support for autodown
Michaelvll Oct 10, 2022
10f41e6
Change API to terminate
Michaelvll Oct 10, 2022
420510f
fix flag
Michaelvll Oct 10, 2022
9a741aa
fix autostop
Michaelvll Oct 10, 2022
f057f3d
fix comment
Michaelvll Oct 10, 2022
b646909
address comment
Michaelvll Oct 10, 2022
a6ccef8
address comment
Michaelvll Oct 10, 2022
0660eb1
format
Michaelvll Oct 10, 2022
2a45671
Rename terminate to down
Michaelvll Oct 10, 2022
d89fe93
add smoke test
Michaelvll Oct 11, 2022
c2a4c4a
fix autodown for multi-node
Michaelvll Oct 11, 2022
6354619
format
Michaelvll Oct 11, 2022
18bc534
fix syntax
Michaelvll Oct 11, 2022
8658d3e
use gcp for autodown test
Michaelvll Oct 11, 2022
8280c9d
fix smoke test
Michaelvll Oct 11, 2022
5a08c84
fix smoke test
Michaelvll Oct 11, 2022
59ced1d
address comments
Michaelvll Oct 12, 2022
f3b357e
Add comment
Michaelvll Oct 12, 2022
5198b1a
Switch back to terminate
Michaelvll Oct 12, 2022
bce99fc
fix comments
Michaelvll Oct 12, 2022
c214028
Change back to tear down
Michaelvll Oct 12, 2022
ccdd792
Change to tear down
Michaelvll Oct 12, 2022
5425c21
fix comment
Michaelvll Oct 12, 2022
7e309b4
change the logic of --down to use auto-down by default
Michaelvll Oct 12, 2022
b625173
Use autodown for --down and address comments
Michaelvll Oct 13, 2022
306671d
fix comment
Michaelvll Oct 13, 2022
5aff9e4
fix ux
Michaelvll Oct 13, 2022
2cda239
Add test for cancel
Michaelvll Oct 13, 2022
787ac90
fix UX
Michaelvll Oct 13, 2022
b7596b7
fix test_smoke
Michaelvll Oct 13, 2022
e34f88e
address comments
Michaelvll Oct 14, 2022
faee1a0
fix
Michaelvll Oct 14, 2022
e012373
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Oct 14, 2022
f653c84
fix logging and comment
Michaelvll Oct 14, 2022
ca57e69
Merge branch 'master' of github.com:concretevitamin/sky-experiments i…
Michaelvll Oct 14, 2022
57343b8
fix environment variable overwrite
Michaelvll Oct 14, 2022
1d32197
fix smoke test
Michaelvll Oct 14, 2022
83dbdff
print info
Michaelvll Oct 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1589,7 +1589,9 @@ def _update_cluster_status_no_lock(
backend.set_autostop(handle, -1, stream_logs=False)
except (Exception, SystemExit): # pylint: disable=broad-except
logger.debug('Failed to reset autostop.')
global_user_state.set_cluster_autostop_value(handle.cluster_name, -1)
global_user_state.set_cluster_autostop_value(handle.cluster_name,
-1,
to_down=False)

# If the user starts part of a STOPPED cluster, we still need a status to
# represent the abnormal status. For spot cluster, it can also represent
Expand Down
22 changes: 19 additions & 3 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -1116,6 +1116,7 @@ def ray_up():
# different order from directly running in the console. The
# `--log-style` and `--log-color` flags do not work. To reproduce,
# `ray up --log-style pretty --log-color true | tee tmp.out`.

returncode, stdout, stderr = log_lib.run_with_log(
# NOTE: --no-restart solves the following bug. Without it, if
# 'ray up' (sky launch) twice on a cluster with >1 node, the
Expand All @@ -1133,7 +1134,15 @@ def ray_up():
line_processor=log_utils.RayUpLineProcessor(),
# Reduce BOTO_MAX_RETRIES from 12 to 5 to avoid long hanging
# time during 'ray up' if insufficient capacity occurs.
env=dict(os.environ, BOTO_MAX_RETRIES='5'),
env=dict(
BOTO_MAX_RETRIES='5',
# Use environment variables to disable the ray usage stats
# (to avoid the 10 second wait for usage collection
# confirmation), as the ray version on the user's machine
# may be lower version that does not support the
# `--disable-usage-stats` flag.
RAY_USAGE_STATS_ENABLED='0',
**os.environ),
require_outputs=True,
# Disable stdin to avoid ray outputs mess up the terminal with
# misaligned output when multithreading/multiprocessing are used
Expand Down Expand Up @@ -1333,10 +1342,16 @@ def _ensure_cluster_ray_started(self,
'of the local cluster. Check if ray[default]==1.13.0 '
'is installed or running correctly.')
backend.run_on_head(handle, 'ray stop', use_cached_head_ip=False)

log_lib.run_with_log(
['ray', 'up', '-y', '--restart-only', handle.cluster_yaml],
log_abs_path,
stream_logs=False,
# Use environment variables to disable the ray usage collection
# (avoid the 10 second wait for usage collection confirmation),
# as the ray version on the user's machine may be lower version
# that does not support the `--disable-usage-stats` flag.
env=dict(RAY_USAGE_STATS_ENABLED='0', **os.environ),
# Disable stdin to avoid ray outputs mess up the terminal with
# misaligned output when multithreading/multiprocessing is used.
# Refer to: https://github.com/ray-project/ray/blob/d462172be7c5779abf37609aed08af112a533e1e/python/ray/autoscaler/_private/subprocess_output_util.py#L264 # pylint: disable=line-too-long
Expand Down Expand Up @@ -2608,10 +2623,11 @@ def post_teardown_cleanup(self,
def set_autostop(self,
handle: ResourceHandle,
idle_minutes_to_autostop: Optional[int],
down: bool = False,
stream_logs: bool = True) -> None:
if idle_minutes_to_autostop is not None:
code = autostop_lib.AutostopCodeGen.set_autostop(
idle_minutes_to_autostop, self.NAME)
idle_minutes_to_autostop, self.NAME, down)
returncode, _, stderr = self.run_on_head(handle,
code,
require_outputs=True,
Expand All @@ -2622,7 +2638,7 @@ def set_autostop(self,
stderr=stderr,
stream_logs=stream_logs)
global_user_state.set_cluster_autostop_value(
handle.cluster_name, idle_minutes_to_autostop)
handle.cluster_name, idle_minutes_to_autostop, down)

# TODO(zhwu): Refactor this to a CommandRunner class, so different backends
# can support its own command runner.
Expand Down
Loading