Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix for spot TPU pod recovery #1470

Merged
merged 2 commits into from
Nov 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1220,8 +1220,11 @@ def _get_tpu_vm_pod_ips(ray_config: Dict[str, Any],

cluster_name = ray_config['cluster_name']
zone = ray_config['provider']['availability_zone']
# Excluding preempted VMs is safe as they are already terminated and
# do not charge.
query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'"(labels.ray-cluster-name={cluster_name} AND '
f'state!=PREEMPTED)" '
f'--zone={zone} --format=value\\(name\\)')
Comment on lines +1226 to 1228
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense.

A minor thing that may not be good to fix in this PR is that in L1150, we directly return the IPs, without checking whether the number of nodes in the cluster matches the expected nodes as we done in L1209. Can we raise the exceptions.FetchIPError(.FetchIPError.Reason.HEAD) instead of handling it separately below?

if len(ips) == 0:
raise exceptions.FetchIPError(
reason=exceptions.FetchIPError.Reason.HEAD)

Also, get_node_ips also checks the ray cluster is correctly running on the cluster. Is it possible for TPU VM having multiple nodes? If yes, we may want to check the healthiness of the ray cluster as well.

Copy link
Member Author

@infwinston infwinston Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, we should make the behavior consistent and a part of the issue is related to #1185.
I'll make sure that PR fixes this and merge it asap.

if not get_internal_ips:
tpuvm_cmd = (f'gcloud compute tpus tpu-vm describe $({query_cmd})'
Expand All @@ -1242,10 +1245,14 @@ def _get_tpu_vm_pod_ips(ray_config: Dict[str, Any],
'**** STDOUT ****\n'
'{stdout}\n'
'**** STDERR ****\n'
'{stderr}')
'{stderr}\n'
'**** CMD ****\n'
'{tpuvm_cmd}')
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
failure_massage.format(stdout=stdout, stderr=stderr))
failure_massage.format(stdout=stdout,
stderr=stderr,
tpuvm_cmd=tpuvm_cmd))
all_ips = re.findall(IP_ADDR_REGEX, stdout)
return all_ips

Expand Down
5 changes: 4 additions & 1 deletion sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -2621,9 +2621,12 @@ def teardown_no_lock(self,
# check if gcloud includes TPU VM API
backend_utils.check_gcp_cli_include_tpu_vm()

# Excluding preempted VMs is safe as they are already
# terminated and do not charge.
query_cmd = (
f'gcloud compute tpus tpu-vm list --filter='
f'\\(labels.ray-cluster-name={cluster_name}\\) '
f'"(labels.ray-cluster-name={cluster_name} AND '
f'state!=PREEMPTED)" '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to preempted TPUs? Is it true that they require no further cleanup actions -- i.e., gcloud compute tpus tpu-vm delete -- from us (e.g., disks)? Worth a comment!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point I just updated some comments.

According to reference, TPU VM's disk is not persistent unless manually specifying it.
So a preempted TPU VM will cost zero and should be safe to ignore for cost perspective. (IIRC, GCP will clean them up after a while)

However, I agree we should try to cleanup the preempted VMs if possible, which requires changing the logic in spot controller to be tearing down VM first and then sky launch again.

For now I just wanted to ship to wilson asap. I'll create another PR to fix this.

f'--zone={zone} --format=value\\(name\\)')
terminate_cmd = (
f'gcloud compute tpus tpu-vm delete --zone={zone}'
Expand Down