-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spot] Show FAILED_CONTROLLER when controller exit abnormally #1143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @Michaelvll!
sky/spot/spot_utils.py
Outdated
try: | ||
backend.teardown(handle, terminate=True) | ||
except RuntimeError: | ||
logger.error('Failed to tear down the spot cluster ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid setting L103 if we hit this termination error? So it's easier to spot potential leakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the FAILED_CONTROLLER should be a more strong alert sign for the user to check what is going on for the job than the non-changed nonterminal status, such as RUNNING. An alternative can be having a new spot job status, e.g. FAILED_TERMINATION.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is
- FAILED_CONTROLLER: The job failed due to an unexpected error in the spot
controller.
suggests the spot job failed, but it could've leaked if termination fails. Should we treat non-terminal statuses = cluster potentially alive, and terminal = cluster definitely down?
sky/spot/spot_utils.py
Outdated
if handle is not None: | ||
backend = backend_utils.get_backend_from_handle(handle) | ||
try: | ||
backend.teardown(handle, terminate=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add retry (or something to be refactored out with the normal spot cluster termination)? Ok to do it later too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added a retry. Will try to refactor it later. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to merge. We can discuss the status semantics after this PR.
Fixes #1141
Previously, if the controller failed abnormally, the spot status of the job will only be updated if the user run
sky spot cancel <job_id>
.Now we move the status update to the skylet, so that the spot status will be updated automatically.
Tested:
sky spot launch -n test-status 'echo hi; sleep 100000'
;ssh sky-spot-controller-<hash>
;ray job stop --address http://127.0.0.1:8265 <job_id>-ubuntu
; Check thesky spot status
and the~/.sky/skylet.log
after a while.