Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143

Merged
merged 3 commits into from
Sep 1, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Aug 31, 2022

Fixes #1141

Previously, if the controller failed abnormally, the spot status of the job will only be updated if the user run sky spot cancel <job_id>.

Now we move the status update to the skylet, so that the spot status will be updated automatically.

Tested:

  • sky spot launch -n test-status 'echo hi; sleep 100000'; ssh sky-spot-controller-<hash>; ray job stop --address http://127.0.0.1:8265 <job_id>-ubuntu; Check the sky spot status and the ~/.sky/skylet.log after a while.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @Michaelvll!

sky/skylet/skylet.py Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
sky/spot/spot_utils.py Outdated Show resolved Hide resolved
try:
backend.teardown(handle, terminate=True)
except RuntimeError:
logger.error('Failed to tear down the spot cluster '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid setting L103 if we hit this termination error? So it's easier to spot potential leakage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the FAILED_CONTROLLER should be a more strong alert sign for the user to check what is going on for the job than the non-changed nonterminal status, such as RUNNING. An alternative can be having a new spot job status, e.g. FAILED_TERMINATION.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is

  - FAILED_CONTROLLER: The job failed due to an unexpected error in the spot
      controller.

suggests the spot job failed, but it could've leaked if termination fails. Should we treat non-terminal statuses = cluster potentially alive, and terminal = cluster definitely down?

if handle is not None:
backend = backend_utils.get_backend_from_handle(handle)
try:
backend.teardown(handle, terminate=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add retry (or something to be refactored out with the normal spot cluster termination)? Ok to do it later too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added a retry. Will try to refactor it later. :)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge. We can discuss the status semantics after this PR.

@Michaelvll Michaelvll merged commit b36bbcf into master Sep 1, 2022
@Michaelvll Michaelvll deleted the populate-failure-to-spot branch September 1, 2022 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spot] Populate the FAILED job status to the spot job table
2 participants