-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix for spot TPU pod recovery #1470
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2621,9 +2621,12 @@ def teardown_no_lock(self, | |
# check if gcloud includes TPU VM API | ||
backend_utils.check_gcp_cli_include_tpu_vm() | ||
|
||
# Excluding preempted VMs is safe as they are already | ||
# terminated and do not charge. | ||
query_cmd = ( | ||
f'gcloud compute tpus tpu-vm list --filter=' | ||
f'\\(labels.ray-cluster-name={cluster_name}\\) ' | ||
f'"(labels.ray-cluster-name={cluster_name} AND ' | ||
f'state!=PREEMPTED)" ' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens to preempted TPUs? Is it true that they require no further cleanup actions -- i.e., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point I just updated some comments. According to reference, TPU VM's disk is not persistent unless manually specifying it. However, I agree we should try to cleanup the preempted VMs if possible, which requires changing the logic in spot controller to be tearing down VM first and then For now I just wanted to ship to wilson asap. I'll create another PR to fix this. |
||
f'--zone={zone} --format=value\\(name\\)') | ||
terminate_cmd = ( | ||
f'gcloud compute tpus tpu-vm delete --zone={zone}' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense.
A minor thing that may not be good to fix in this PR is that in L1150, we directly return the IPs, without checking whether the number of nodes in the cluster matches the expected nodes as we done in L1209. Can we raise the
exceptions.FetchIPError(.FetchIPError.Reason.HEAD)
instead of handling it separately below?skypilot/sky/backends/backend_utils.py
Lines 1620 to 1622 in 2abcde3
Also,
get_node_ips
also checks the ray cluster is correctly running on the cluster. Is it possible for TPU VM having multiple nodes? If yes, we may want to check the healthiness of the ray cluster as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, we should make the behavior consistent and a part of the issue is related to #1185.
I'll make sure that PR fixes this and merge it asap.