-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Support manually terminating a replica and with purge option #4032
base: master
Are you sure you want to change the base?
[Serve] Support manually terminating a replica and with purge option #4032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @andylizf ! It would be really helpful. Left some discussion :))
@@ -4367,23 +4379,38 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool): | |||
raise click.UsageError( | |||
'Can only specify one of SERVICE_NAMES or --all. ' | |||
f'Provided {argument_str!r}.') | |||
replica_id_is_defined = replica_id is not None | |||
if replica_id_is_defined and len(service_names) != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also check the all
option is False
.
if replica_info.status not in serve_state.ReplicaStatus.failed_statuses( | ||
): | ||
return { | ||
'message': f'No purging for replica {replica_id} since ' | ||
f'the replica does not have a failed status.' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow --purge
to terminate a healthy replica as well, to align the semantic of sky serve down
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I thought "purge" means cleaning failed instances in the project.
@@ -4331,9 +4331,15 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]): | |||
default=False, | |||
required=False, | |||
help='Skip confirmation prompt.') | |||
@click.option('--replica-id', | |||
'-r', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not use this abbreviation as it conflicts with --refresh
.
self._replica_manager.scale_down(replica_id) | ||
return {'message': f'Success terminating replica {replica_id}.'} | ||
|
||
except Exception as e: # pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of error is this except for?
@@ -85,6 +87,37 @@ def _run_autoscaler(self): | |||
logger.error(f' Traceback: {traceback.format_exc()}') | |||
time.sleep(self._autoscaler.get_decision_interval()) | |||
|
|||
def _purge_replica( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems relatively shallow. Should we merge it to its caller?
@@ -325,8 +328,7 @@ def update( | |||
'Service controller is stopped. There is no service to update. ' | |||
f'To spin up a new service, use {backend_utils.BOLD}' | |||
f'sky serve up{backend_utils.RESET_BOLD}', | |||
non_existent_message='Service does not exist. ' | |||
'To spin up a new service, ' | |||
non_existent_message='To spin up a new service, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this?
non_existent_message='To spin up a new service, ' | ||
f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non_existent_message='To spin up a new service, ' | |
f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}', | |
non_existent_message='No service is running now. Please spin up a service first.', |
replica_id = request_data.get('replica_id') | ||
if replica_id is None: | ||
return { | ||
'code': 400, | ||
'message': 'Error: replica ID is not specified.' | ||
} | ||
purge = request_data.get('purge') | ||
if purge is None: | ||
return { | ||
'code': 400, | ||
'message': 'Error: purge is not specified.' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assert those vars is not None as this endpoint is only accessed by our code?
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Fixes #3135
Continuing the work from #3179, this PR implements the cluster cleanup functionality mentioned in the comments. Main changes:
purge: bool
parameter toReplicaManager.scale_down()
methodpurge=True
, the method now terminates the associated cluster in addition to removing the replica recordThis addresses the remaining feedback from the previous PR regarding complete cleanup of failed replicas.
Tested:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh