Add host VM - GPU compatibility checks for GCP #989

WoosukKwon · 2022-07-18T06:22:46Z

This PR checks compatibility between GCP host VMs and accelerators. For example, GPUs (except A100) can be only attached to N1 machines, and each GPU has limitations on the number of vCPUs and amount of CPU memory that its host VM can have. This PR hard-codes such information in gcp_catalog.py and lets users know when their requests are invalid.

Tested:

sky gpunode --instance-type n1-highmem-16 --gpus K80 -c test (invalid)
sky gpunode --instance-type n1-highmem-16 --gpus K80:2 -c test (valid)
sky gpunode --instance-type n1-highmem-16 --gpus A10G -c test (invalid)
sky gpunode --instance-type a2-highgpu-1g -c test (invalid)
sky gpunode --instance-type a2-highgpu-1g --gpus A100:2 -c test (invalid)
sky gpunode --instance-type n1-highcpu-16 --gpus A100 -c test (invalid)

concretevitamin

Very nice @WoosukKwon - now users don't need to go through the failover loop observing the HttpError's. Some comments.

tests/test_optimizer_random_dag.py

sky/clouds/service_catalog/__init__.py

sky/clouds/service_catalog/gcp_catalog.py

WoosukKwon · 2022-07-30T07:54:16Z

@concretevitamin Thanks for your review! While I addressed all of your comments, I found that this PR breaks sky exec and sky launch -c existing-cluster. For existing clusters, we only need to check if the resource request is less demanding than what the cluster has. Thus, the check_host_accelerator_availability function should be called only when a new cluster is launched.

I found that such a compatibility check is also needed for other clouds and filed the issue #1025.

sky/clouds/service_catalog/gcp_catalog.py

concretevitamin

Thanks!

WoosukKwon · 2022-08-29T21:39:02Z

@concretevitamin I made the compatibility check invoked by the optimizer. Now this PR does not break sky launch and sky exec on existing clusters. However, a slight downside of this implementation is that in sky spot launch the compatibility check is not made until the spot controller runs the optimizer. I think we can address this in a future PR. PTAL.

concretevitamin

Nice @WoosukKwon! Consider running the smoke tests before merging.

concretevitamin · 2022-08-29T22:03:27Z

sky/clouds/service_catalog/gcp_catalog.py

+    # Check maximum vCPUs and memory.
+    if acc_name not in _NUM_ACC_TO_MAX_CPU_AND_MEMORY:
+        with ux_utils.print_exception_no_traceback():
+            raise exceptions.ResourcesUnavailableError(
+                f'{acc_name} is not available in GCP. '
+                'See \'sky show-gpus --cloud gcp\'')


Should this be assert acc_name in _NUM_ACC_TO_MAX_CPU_AND_MEMORY?

It should've been caught outside. E.g., under this branch

» sky launch --cloud gcp --gpus M60 '' 1 ↵ I 08-29 15:02:52 optimizer.py:879] No resource satisfying {'M60': 1} on [GCP]. sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements. Hint: 'sky show-gpus --all' to list available accelerators. 'sky check' to check the enabled clouds.

That's a good point. Actually, --cloud gcp --gpus M60 '' and --instance-type n1-highmem-8 --gpus M60 will raise different error messages:

$ sky launch --cloud gcp --gpus M60 '' I 08-30 22:06:54 optimizer.py:875] No resource satisfying {'M60': 1} on [GCP]. sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements. Hint: 'sky show-gpus --all' to list available accelerators. 'sky check' to check the enabled clouds.

$ sky launch --instance-type n1-highmem-8 --gpus M60 '' sky.exceptions.ResourcesUnavailableError: M60 is not available in GCP. See 'sky show-gpus --cloud gcp'

In the first case, the optimizer asks itself which instance to choose, and finds that GCP does not support M60. On the other hand, in the second case, the optimizer checks whether M60 can be attached to n1-highmem-8, and the new checks added in gcp_catalog finds the error. Since the two cases take different paths, the error messages are different.

…timizer

WoosukKwon · 2022-08-31T05:27:45Z

I changed the implementation substantially. The PR now consists of two new functions check_host_accelerator_compatibility and check_accelerator_attachable_to_host.

The first check_host_accelerator_compatibility function is invoked when Resources objects are created. It simply checks that accelerators are used with N1 machines, and does NOT check the maximum vCPU count and maximum memory limits for the accelerator because any Resources like GCP(n1-highmem-64, {'V100': 0.01} are allowed for sky exec.

The second check_accelerator_attachable_to_host function checks the cpu and memory limits. It is invoked by the optimizer, so sky exec will not execute this function.

@concretevitamin Could you please take another look?

concretevitamin

LGTM @WoosukKwon with a minor question. Reminder to rerun smoke tests before merging.

concretevitamin · 2022-08-31T05:48:10Z

sky/optimizer.py

@@ -887,4 +883,10 @@ def _fill_in_launchable_resources(
        launchable[resources] = _filter_out_blocked_launchable_resources(
            launchable[resources], blocked_launchable_resources)

+        for r in launchable[resources]:


Q: why move it to here, rather than after L849? Was thinking checking resources in that loop makes more sense, as it represents a validation of the user-requested resources. Here, it may be possible than launchable[resources] has more than 1 "expanded" resources, and throwing an error on these may be unexpected?

What do you mean by the "expanded" resources? I thought this check should be applied to every case, as the max cpu and memory limits must be respected to launch an instance on GCP.

I meant that here launchable[resources] may have more than 1 element, - can some of them pass the check, while some fail? In these cases it may make sense to remove the candidates that fail rather than raising an error to the whole program.

OK. That makes sense. I've rolled back the change.

concretevitamin · 2022-08-31T05:55:21Z

sky/clouds/service_catalog/__init__.py

                                         clouds: CloudFilter = None) -> None:
-    """GCP only: Check if host VM type is compatible with the accelerators."""
+    """GCP only: Check if host VM type is compatible with the accelerators.


Can we add #989 (comment) to this func and the next func (L207+)? It's great explanation on why these two funcs are structured this way.

WoosukKwon · 2022-08-31T19:29:20Z

I've checked that this PR does not break any smoke test.

WoosukKwon · 2022-08-31T21:30:59Z

@concretevitamin If you don't have any more concern about this PR, I'll merge it.

concretevitamin · 2022-08-31T22:57:25Z

Let’s ship it!

concretevitamin · 2022-10-11T06:53:57Z

Let’s ship it!

…

On Wed, Aug 31, 2022 at 14:31 Woosuk Kwon ***@***.***> wrote: @concretevitamin <https://github.com/concretevitamin> If you don't have any more concern about this PR, I'll merge it. — Reply to this email directly, view it on GitHub <#989 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEQWHUJYT5XIY3CKBCLAH3V37FJ5ANCNFSM533DAXVQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

WoosukKwon added 5 commits July 17, 2022 23:12

Add host VM - GPU compatibility check for GCP

3f50915

Minor fix

4d1b142

Merge branch 'master' into gcp-host-vm

8319e23

Fix test_spot

9adb053

Fix optimizer test

a5f27e2

concretevitamin reviewed Jul 30, 2022

View reviewed changes

WoosukKwon added 6 commits July 29, 2022 22:59

Merge branch 'master' into gcp-host-vm

a3fd8d0

Fix optimizer test

6a4b345

Minor bugfix + Add reference URL in error message

0cd3bf9

Minor fix in docstring

e5e5f9d

Get memory size from catalog

135fa71

Add TODO

940827c

concretevitamin reviewed Jul 30, 2022

View reviewed changes

sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved

sky/clouds/service_catalog/gcp_catalog.py Outdated Show resolved Hide resolved

concretevitamin approved these changes Jul 30, 2022

View reviewed changes

concretevitamin added the do not merge do not merge this PR now label Aug 24, 2022

WoosukKwon added 6 commits August 28, 2022 22:41

Merge branch 'master' into gcp-host-vm

737cb71

Resolve merge conflicts & Address TODOs

fc23709

Merge branch 'master' into gcp-host-vm

b8e5864

Move compatibility check to optimizer

c9ad421

ValueError -> ResourcesMismatchError

02d4159

Fix TPU error msg

974ad50

WoosukKwon removed the do not merge do not merge this PR now label Aug 29, 2022

Minor bugfix

5040460

WoosukKwon requested a review from concretevitamin August 29, 2022 21:39

concretevitamin approved these changes Aug 29, 2022

View reviewed changes

WoosukKwon added 3 commits August 29, 2022 18:37

Move compatibility check to resources & Add attachability check in op…

cc32bcd

…timizer

yapf

102f0c9

Consider accelerators == None

b182e76

WoosukKwon added 3 commits August 29, 2022 18:58

Consider accelerators == None

eaa28ff

Merge branch 'master' into gcp-host-vm

b2a19e4

Add comments

309081d

WoosukKwon requested a review from concretevitamin August 31, 2022 05:29

concretevitamin approved these changes Aug 31, 2022

View reviewed changes

Add comments

50bbd91

Address comments

b90549b

WoosukKwon merged commit 3c8b5e2 into master Aug 31, 2022

WoosukKwon deleted the gcp-host-vm branch August 31, 2022 23:14

WoosukKwon mentioned this pull request Sep 12, 2022

Fix GCP A100 launch error #1166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add host VM - GPU compatibility checks for GCP #989

Add host VM - GPU compatibility checks for GCP #989

WoosukKwon commented Jul 18, 2022 •

edited

Loading

concretevitamin left a comment

WoosukKwon commented Jul 30, 2022 •

edited

Loading

concretevitamin left a comment

WoosukKwon commented Aug 29, 2022

concretevitamin left a comment

concretevitamin Aug 29, 2022

WoosukKwon Aug 31, 2022

WoosukKwon commented Aug 31, 2022

concretevitamin left a comment

concretevitamin Aug 31, 2022

WoosukKwon Aug 31, 2022

concretevitamin Aug 31, 2022

WoosukKwon Aug 31, 2022

concretevitamin Aug 31, 2022

WoosukKwon Aug 31, 2022

WoosukKwon commented Aug 31, 2022

WoosukKwon commented Aug 31, 2022

concretevitamin commented Aug 31, 2022

concretevitamin commented Oct 11, 2022 via email

Add host VM - GPU compatibility checks for GCP #989

Add host VM - GPU compatibility checks for GCP #989

Conversation

WoosukKwon commented Jul 18, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

WoosukKwon commented Jul 30, 2022 • edited Loading

concretevitamin left a comment

Choose a reason for hiding this comment

WoosukKwon commented Aug 29, 2022

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Aug 31, 2022

concretevitamin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WoosukKwon commented Aug 31, 2022

WoosukKwon commented Aug 31, 2022

concretevitamin commented Aug 31, 2022

concretevitamin commented Oct 11, 2022 via email

WoosukKwon commented Jul 18, 2022 •

edited

Loading

WoosukKwon commented Jul 30, 2022 •

edited

Loading