[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

Michaelvll · 2022-08-31T04:08:01Z

Previously, if the controller failed abnormally, the spot status of the job will only be updated if the user run sky spot cancel <job_id>.

Now we move the status update to the skylet, so that the spot status will be updated automatically.

Tested:

sky spot launch -n test-status 'echo hi; sleep 100000'; ssh sky-spot-controller-<hash>; ray job stop --address http://127.0.0.1:8265 <job_id>-ubuntu; Check the sky spot status and the ~/.sky/skylet.log after a while.

concretevitamin

Thanks for the quick fix @Michaelvll!

sky/skylet/skylet.py

sky/spot/spot_utils.py

concretevitamin · 2022-09-01T05:19:29Z

sky/spot/spot_utils.py

+                try:
+                    backend.teardown(handle, terminate=True)
+                except RuntimeError:
+                    logger.error('Failed to tear down the spot cluster '


Should we avoid setting L103 if we hit this termination error? So it's easier to spot potential leakage.

I feel like the FAILED_CONTROLLER should be a more strong alert sign for the user to check what is going on for the job than the non-changed nonterminal status, such as RUNNING. An alternative can be having a new spot job status, e.g. FAILED_TERMINATION.

My understanding is

- FAILED_CONTROLLER: The job failed due to an unexpected error in the spot controller.

suggests the spot job failed, but it could've leaked if termination fails. Should we treat non-terminal statuses = cluster potentially alive, and terminal = cluster definitely down?

concretevitamin · 2022-09-01T05:34:39Z

sky/spot/spot_utils.py

+            if handle is not None:
+                backend = backend_utils.get_backend_from_handle(handle)
+                try:
+                    backend.teardown(handle, terminate=True)


Should we add retry (or something to be refactored out with the normal spot cluster termination)? Ok to do it later too.

Good point! Added a retry. Will try to refactor it later. :)

concretevitamin

LGTM to merge. We can discuss the status semantics after this PR.

add spot status update to skylet

c7d85b2

Michaelvll requested a review from concretevitamin August 31, 2022 18:27

concretevitamin approved these changes Sep 1, 2022

View reviewed changes

Michaelvll added 2 commits August 31, 2022 22:43

address comments

972ecfc

Address comments

91f25b6

concretevitamin approved these changes Sep 1, 2022

View reviewed changes

Michaelvll merged commit b36bbcf into master Sep 1, 2022

Michaelvll deleted the populate-failure-to-spot branch September 1, 2022 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

Michaelvll commented Aug 31, 2022 •

edited

Loading

concretevitamin left a comment

concretevitamin Sep 1, 2022

Michaelvll Sep 1, 2022

concretevitamin Sep 1, 2022

concretevitamin Sep 1, 2022

Michaelvll Sep 1, 2022

concretevitamin left a comment

[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

Conversation

Michaelvll commented Aug 31, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Sep 1, 2022

Choose a reason for hiding this comment

Michaelvll Sep 1, 2022

Choose a reason for hiding this comment

concretevitamin Sep 1, 2022

Choose a reason for hiding this comment

concretevitamin Sep 1, 2022

Choose a reason for hiding this comment

Michaelvll Sep 1, 2022

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

Michaelvll commented Aug 31, 2022 •

edited

Loading