Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add host VM - GPU compatibility checks for GCP #989

Merged
merged 26 commits into from
Aug 31, 2022
Merged

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Jul 18, 2022

This PR checks compatibility between GCP host VMs and accelerators. For example, GPUs (except A100) can be only attached to N1 machines, and each GPU has limitations on the number of vCPUs and amount of CPU memory that its host VM can have. This PR hard-codes such information in gcp_catalog.py and lets users know when their requests are invalid.

Tested:

  • sky gpunode --instance-type n1-highmem-16 --gpus K80 -c test (invalid)
  • sky gpunode --instance-type n1-highmem-16 --gpus K80:2 -c test (valid)
  • sky gpunode --instance-type n1-highmem-16 --gpus A10G -c test (invalid)
  • sky gpunode --instance-type a2-highgpu-1g -c test (invalid)
  • sky gpunode --instance-type a2-highgpu-1g --gpus A100:2 -c test (invalid)
  • sky gpunode --instance-type n1-highcpu-16 --gpus A100 -c test (invalid)

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @WoosukKwon - now users don't need to go through the failover loop observing the HttpError's. Some comments.

tests/test_optimizer_random_dag.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/__init__.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved
@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented Jul 30, 2022

@concretevitamin Thanks for your review! While I addressed all of your comments, I found that this PR breaks sky exec and sky launch -c existing-cluster. For existing clusters, we only need to check if the resource request is less demanding than what the cluster has. Thus, the check_host_accelerator_availability function should be called only when a new cluster is launched.

I found that such a compatibility check is also needed for other clouds and filed the issue #1025.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@concretevitamin concretevitamin added the do not merge do not merge this PR now label Aug 24, 2022
@WoosukKwon WoosukKwon removed the do not merge do not merge this PR now label Aug 29, 2022
@WoosukKwon
Copy link
Collaborator Author

@concretevitamin I made the compatibility check invoked by the optimizer. Now this PR does not break sky launch and sky exec on existing clusters. However, a slight downside of this implementation is that in sky spot launch the compatibility check is not made until the spot controller runs the optimizer. I think we can address this in a future PR. PTAL.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @WoosukKwon! Consider running the smoke tests before merging.

Comment on lines 362 to 367
# Check maximum vCPUs and memory.
if acc_name not in _NUM_ACC_TO_MAX_CPU_AND_MEMORY:
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesUnavailableError(
f'{acc_name} is not available in GCP. '
'See \'sky show-gpus --cloud gcp\'')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be assert acc_name in _NUM_ACC_TO_MAX_CPU_AND_MEMORY?

It should've been caught outside. E.g., under this branch

» sky launch --cloud gcp --gpus M60  ''                                                         1 ↵
I 08-29 15:02:52 optimizer.py:879] No resource satisfying {'M60': 1} on [GCP].
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Actually, --cloud gcp --gpus M60 '' and --instance-type n1-highmem-8 --gpus M60 will raise different error messages:

$ sky launch --cloud gcp --gpus M60 ''
I 08-30 22:06:54 optimizer.py:875] No resource satisfying {'M60': 1} on [GCP].
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
      'sky check' to check the enabled clouds.
$ sky launch --instance-type n1-highmem-8 --gpus M60 ''
sky.exceptions.ResourcesUnavailableError: M60 is not available in GCP. See 'sky show-gpus --cloud gcp'

In the first case, the optimizer asks itself which instance to choose, and finds that GCP does not support M60. On the other hand, in the second case, the optimizer checks whether M60 can be attached to n1-highmem-8, and the new checks added in gcp_catalog finds the error. Since the two cases take different paths, the error messages are different.

@WoosukKwon
Copy link
Collaborator Author

I changed the implementation substantially. The PR now consists of two new functions check_host_accelerator_compatibility and check_accelerator_attachable_to_host.

The first check_host_accelerator_compatibility function is invoked when Resources objects are created. It simply checks that accelerators are used with N1 machines, and does NOT check the maximum vCPU count and maximum memory limits for the accelerator because any Resources like GCP(n1-highmem-64, {'V100': 0.01} are allowed for sky exec.

The second check_accelerator_attachable_to_host function checks the cpu and memory limits. It is invoked by the optimizer, so sky exec will not execute this function.

@concretevitamin Could you please take another look?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @WoosukKwon with a minor question. Reminder to rerun smoke tests before merging.

sky/optimizer.py Outdated
@@ -887,4 +883,10 @@ def _fill_in_launchable_resources(
launchable[resources] = _filter_out_blocked_launchable_resources(
launchable[resources], blocked_launchable_resources)

for r in launchable[resources]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why move it to here, rather than after L849? Was thinking checking resources in that loop makes more sense, as it represents a validation of the user-requested resources. Here, it may be possible than launchable[resources] has more than 1 "expanded" resources, and throwing an error on these may be unexpected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the "expanded" resources? I thought this check should be applied to every case, as the max cpu and memory limits must be respected to launch an instance on GCP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that here launchable[resources] may have more than 1 element, - can some of them pass the check, while some fail? In these cases it may make sense to remove the candidates that fail rather than raising an error to the whole program.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That makes sense. I've rolled back the change.

clouds: CloudFilter = None) -> None:
"""GCP only: Check if host VM type is compatible with the accelerators."""
"""GCP only: Check if host VM type is compatible with the accelerators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add #989 (comment) to this func and the next func (L207+)? It's great explanation on why these two funcs are structured this way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@WoosukKwon
Copy link
Collaborator Author

I've checked that this PR does not break any smoke test.

@WoosukKwon
Copy link
Collaborator Author

@concretevitamin If you don't have any more concern about this PR, I'll merge it.

@concretevitamin
Copy link
Member

Let’s ship it!

@WoosukKwon WoosukKwon merged commit 3c8b5e2 into master Aug 31, 2022
@WoosukKwon WoosukKwon deleted the gcp-host-vm branch August 31, 2022 23:14
@concretevitamin
Copy link
Member

concretevitamin commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants