[Serve] Support manually terminating a replica and with purge option #4032

andylizf · 2024-10-04T04:55:09Z

Continuing the work from #3179, this PR implements the cluster cleanup functionality mentioned in the comments. Main changes:

Added a purge: bool parameter to ReplicaManager.scale_down() method
When purge=True, the method now terminates the associated cluster in addition to removing the replica record
Updated the controller endpoint to utilize this new purge functionality

This addresses the remaining feedback from the previous PR regarding complete cleanup of failed replicas.

Tested:

Manual tests with purging failed and ready replicas
Verified cluster termination and record removal

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cblmemo

Thanks for adding this feature @andylizf ! It would be really helpful. Left some discussion :))

sky/cli.py

cblmemo · 2024-10-04T18:11:37Z

sky/cli.py

@@ -4367,23 +4379,38 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
        raise click.UsageError(
            'Can only specify one of SERVICE_NAMES or --all. '
            f'Provided {argument_str!r}.')
+    replica_id_is_defined = replica_id is not None
+    if replica_id_is_defined and len(service_names) != 1:


Lets also check the all option is False.

cblmemo · 2024-10-04T18:15:08Z

sky/serve/controller.py

+        if replica_info.status not in serve_state.ReplicaStatus.failed_statuses(
+        ):
+            return {
+                'message': f'No purging for replica {replica_id} since '
+                           f'the replica does not have a failed status.'
+            }


Should we allow --purge to terminate a healthy replica as well, to align the semantic of sky serve down?

Good point. I thought "purge" means cleaning failed instances in the project.

cblmemo · 2024-10-04T18:22:15Z

sky/cli.py

@@ -4331,9 +4331,15 @@ def serve_status(all: bool, endpoint: bool, service_names: List[str]):
              default=False,
              required=False,
              help='Skip confirmation prompt.')
+@click.option('--replica-id',
+              '-r',


Lets not use this abbreviation as it conflicts with --refresh.

cblmemo · 2024-10-04T18:24:42Z

sky/serve/controller.py

+                self._replica_manager.scale_down(replica_id)
+                return {'message': f'Success terminating replica {replica_id}.'}
+
+            except Exception as e:  # pylint: disable=broad-except


what kind of error is this except for?

cblmemo · 2024-10-04T18:25:01Z

sky/serve/controller.py

@@ -85,6 +87,37 @@ def _run_autoscaler(self):
                    logger.error(f'  Traceback: {traceback.format_exc()}')
            time.sleep(self._autoscaler.get_decision_interval())

+    def _purge_replica(


This function seems relatively shallow. Should we merge it to its caller?

cblmemo · 2024-10-04T18:25:21Z

sky/serve/core.py

@@ -325,8 +328,7 @@ def update(
        'Service controller is stopped. There is no service to update. '
        f'To spin up a new service, use {backend_utils.BOLD}'
        f'sky serve up{backend_utils.RESET_BOLD}',
-        non_existent_message='Service does not exist. '
-        'To spin up a new service, '
+        non_existent_message='To spin up a new service, '


why change this?

cblmemo · 2024-10-04T18:26:46Z

sky/serve/core.py

+        non_existent_message='To spin up a new service, '
+        f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',


Suggested change

non_existent_message='To spin up a new service, '

f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',

non_existent_message='No service is running now. Please spin up a service first.',

cblmemo · 2024-10-04T18:29:09Z

sky/serve/controller.py

+                replica_id = request_data.get('replica_id')
+                if replica_id is None:
+                    return {
+                        'code': 400,
+                        'message': 'Error: replica ID is not specified.'
+                    }
+                purge = request_data.get('purge')
+                if purge is None:
+                    return {
+                        'code': 400,
+                        'message': 'Error: purge is not specified.'
+                    }


We can assert those vars is not None as this endpoint is only accessed by our code?

Co-authored-by: Tian Xia <cblmemo@gmail.com>

David Tran and others added 21 commits February 17, 2024 23:27

define replica id param in cli

0819a2c

create endpoint on controller

a66a28b

call controller endpoint to scale down replica

c74765d

Merge branch 'master' into serve/manually-terminate-replica

cfbe884

add classmethod decorator

fc516fb

add handler methods for readability in cli

2786c25

update docstr and error msg, and inline in cli

1799c2f

update log and return err msg

41e6389

add docstr, catch and reraise err, add stopped and nonexistent message

82626ee

inline constant to avoid circular import

27eff42

fix error statement and return encoded str

53c0f32

add purge feature

b838b1f

add purge replica usage in docstr

10d43af

use .get to handle unexpected packages

f4acaa7

Merge branch 'master' into serve/manually-terminate-replica

65db0b8

fix: diff terminate replica when failed/purging or not

8f1d7e5

fix: stay up to date for is_controller_accessible

39719f6

revert

003ba21

up to date with current APIs

3555459

error handling

9a87084

when purged remove record in the main loop

1512cba

cblmemo reviewed Oct 4, 2024

View reviewed changes

Update sky/cli.py

d708a82

Co-authored-by: Tian Xia <cblmemo@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Support manually terminating a replica and with purge option #4032

[Serve] Support manually terminating a replica and with purge option #4032

andylizf commented Oct 4, 2024 •

edited

Loading

cblmemo left a comment

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

andylizf Oct 4, 2024

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

cblmemo Oct 4, 2024

		non_existent_message='To spin up a new service, '
		f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',

	non_existent_message='To spin up a new service, '
	f'use {backend_utils.BOLD}sky serve up{backend_utils.RESET_BOLD}',
	non_existent_message='No service is running now. Please spin up a service first.',

[Serve] Support manually terminating a replica and with purge option #4032

Are you sure you want to change the base?

[Serve] Support manually terminating a replica and with purge option #4032

Conversation

andylizf commented Oct 4, 2024 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andylizf commented Oct 4, 2024 •

edited

Loading