[TPU/Spot] TPU pods fail to be launched after preempted #1468

Michaelvll · 2022-11-29T18:02:25Z

Our user reported that when the TPU pod is preempted, and the spot controller tries to launch it again, it fails.

Just for more info that might help, I can confirm it seems to work fine if I manually delete the TPU instances. The spot controller then has no trouble detecting the preemption and creating + running a new instance. The only case where it does not work is when the TPU instance is preempted (goes into a red state on the TPU dashboard). Perhaps the old preempted instance is not being deleted properly?

infwinston · 2022-12-08T07:09:33Z

According to our user, the bug still exists.
Reason: During preemption, we expect GCP to turn the VM state from READY to PREEMPTED as shown in the document .
However, this may be false sometimes. GCP seems to turn the state to something other than PREEMPTED and make skypilot recognize the cluster as INIT state and fail to clean up the resource.

...
12-08 04:43:16 controller.py:118] Cluster is preempted (status: INIT). Recovering...
12-08 04:43:16 spot_state.py:134] === Recovering... ===

infwinston · 2022-12-08T07:31:10Z

#1500 proposes a solution by
https://github.com/skypilot-org/skypilot/pull/1500/files#diff-6749e0638b4e0e0bf9e5b2e0be361b6394a82458a729e81c2ff2ca6dcd6a653aR315
to explicitly terminate the cluster before launching another one as status -r may not work.

infwinston · 2022-12-16T18:12:47Z

Fixed by #1500.

Michaelvll assigned infwinston Nov 29, 2022

Michaelvll added bug Something isn't working P0 labels Nov 29, 2022

infwinston mentioned this issue Nov 30, 2022

Hotfix for spot TPU pod recovery #1470

Merged

infwinston closed this as completed in #1470 Nov 30, 2022

infwinston mentioned this issue Dec 1, 2022

Clean up preempted resources for TPU #1483

Merged

2 tasks

infwinston reopened this Dec 8, 2022

infwinston closed this as completed Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU/Spot] TPU pods fail to be launched after preempted #1468

[TPU/Spot] TPU pods fail to be launched after preempted #1468

Michaelvll commented Nov 29, 2022

infwinston commented Dec 8, 2022 •

edited

Loading

infwinston commented Dec 8, 2022

infwinston commented Dec 16, 2022

[TPU/Spot] TPU pods fail to be launched after preempted #1468

[TPU/Spot] TPU pods fail to be launched after preempted #1468

Comments

Michaelvll commented Nov 29, 2022

infwinston commented Dec 8, 2022 • edited Loading

infwinston commented Dec 8, 2022

infwinston commented Dec 16, 2022

infwinston commented Dec 8, 2022 •

edited

Loading