Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Make get_job_timestamp fetching more robust #1148

Merged
merged 2 commits into from
Sep 1, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Sep 1, 2022

Previously, our start_at for each job is default to NULL. When the job is set to FAILED by skylet, the get_job_timestamp below will fail, due to cannot convert None to float. We now make the start_at default to -1, so that the value will always be available.

launch_time = spot_utils.get_job_timestamp(self.backend,

The problem was caught by @concretevitamin in spot jobs:

(zongheng-107 pid=467225) I 09-01 18:39:30 recovery_strategy.py:147] Failed to launch the spot cluster with error: Command rsync -Pavz --filter='dir-merge,- .gitignore' -e "ssh -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ConnectTimeout=30s -o ForwardAgent=yes -o ControlMaster=auto -o ControlPath=/tmp/skypilot_ssh_ubuntu/985053b910/%C -o ControlPersist=300s" /tmp/sky_app_gvkun05j gcpuser@34.162.255.162:~/.sky/sky_app/sky_job_1 failed with return code 3.
(zongheng-107 pid=467225) I 09-01 18:39:30 recovery_strategy.py:147] Failed to rsync up: /tmp/sky_app_gvkun05j -> ~/.sky/sky_app/sky_job_1

ValueError: could not convert string to float: 'None\n'

Note: Our log_utils.readable_time_duration will handle the case where the start_at < 0 just as start_at is None.

if start is None or start < 0:
return '-'

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

sky/skylet/job_lib.py Show resolved Hide resolved
@Michaelvll Michaelvll merged commit 39cfaa7 into master Sep 1, 2022
@Michaelvll Michaelvll deleted the fix-job-start-at branch September 1, 2022 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants