Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Support manually terminating a replica and with purge option #4032

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

andylizf
Copy link
Contributor

@andylizf andylizf commented Oct 4, 2024

Fixes #3135

Continuing the work from #3179, this PR implements the cluster cleanup functionality mentioned in the comments. Main changes:

  • Added a purge: bool parameter to ReplicaManager.scale_down() method
  • When purge=True, the method now terminates the associated cluster in addition to removing the replica record
  • Updated the controller endpoint to utilize this new purge functionality

This addresses the remaining feedback from the previous PR regarding complete cleanup of failed replicas.

Tested:

  • Manual tests with purging failed and ready replicas
  • Verified cluster termination and record removal

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @andylizf ! It would be really helpful. Left some discussion :))

sky/cli.py Outdated Show resolved Hide resolved
@@ -4367,23 +4379,38 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
raise click.UsageError(
'Can only specify one of SERVICE_NAMES or --all. '
f'Provided {argument_str!r}.')
replica_id_is_defined = replica_id is not None
if replica_id_is_defined and len(service_names) != 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also check the all option is False.

Comment on lines +96 to +101
if replica_info.status not in serve_state.ReplicaStatus.failed_statuses(
):
return {
'message': f'No purging for replica {replica_id} since '
f'the replica does not have a failed status.'
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow --purge to terminate a healthy replica as well, to align the semantic of sky serve down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I thought "purge" means cleaning failed instances in the project.

@@ -4331,9 +4331,15 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]):
default=False,
required=False,
help='Skip confirmation prompt.')
@click.option('--replica-id',
'-r',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not use this abbreviation as it conflicts with --refresh.

self._replica_manager.scale_down(replica_id)
return {'message': f'Success terminating replica {replica_id}.'}

except Exception as e: # pylint: disable=broad-except
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of error is this except for?

@@ -85,6 +87,37 @@ def _run_autoscaler(self):
logger.error(f' Traceback: {traceback.format_exc()}')
time.sleep(self._autoscaler.get_decision_interval())

def _purge_replica(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems relatively shallow. Should we merge it to its caller?

@@ -325,8 +328,7 @@ def update(
'Service controller is stopped. There is no service to update. '
f'To spin up a new service, use {backend_utils.BOLD}'
f'sky serve up{backend_utils.RESET_BOLD}',
non_existent_message='Service does not exist. '
'To spin up a new service, '
non_existent_message='To spin up a new service, '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Comment on lines +524 to +525
non_existent_message='To spin up a new service, '
f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
non_existent_message='To spin up a new service, '
f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',
non_existent_message='No service is running now. Please spin up a service first.',

Comment on lines +182 to +193
replica_id = request_data.get('replica_id')
if replica_id is None:
return {
'code': 400,
'message': 'Error: replica ID is not specified.'
}
purge = request_data.get('purge')
if purge is None:
return {
'code': 400,
'message': 'Error: purge is not specified.'
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assert those vars is not None as this endpoint is only accessed by our code?

Co-authored-by: Tian Xia <cblmemo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Serve] Support for terminating one of the replica manually
3 participants