-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add host VM - GPU compatibility checks for GCP #989
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @WoosukKwon - now users don't need to go through the failover loop observing the HttpError's. Some comments.
@concretevitamin Thanks for your review! While I addressed all of your comments, I found that this PR breaks I found that such a compatibility check is also needed for other clouds and filed the issue #1025. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@concretevitamin I made the compatibility check invoked by the optimizer. Now this PR does not break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @WoosukKwon! Consider running the smoke tests before merging.
# Check maximum vCPUs and memory. | ||
if acc_name not in _NUM_ACC_TO_MAX_CPU_AND_MEMORY: | ||
with ux_utils.print_exception_no_traceback(): | ||
raise exceptions.ResourcesUnavailableError( | ||
f'{acc_name} is not available in GCP. ' | ||
'See \'sky show-gpus --cloud gcp\'') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be assert acc_name in _NUM_ACC_TO_MAX_CPU_AND_MEMORY
?
It should've been caught outside. E.g., under this branch
» sky launch --cloud gcp --gpus M60 '' 1 ↵
I 08-29 15:02:52 optimizer.py:879] No resource satisfying {'M60': 1} on [GCP].
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
'sky check' to check the enabled clouds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Actually, --cloud gcp --gpus M60 ''
and --instance-type n1-highmem-8 --gpus M60
will raise different error messages:
$ sky launch --cloud gcp --gpus M60 ''
I 08-30 22:06:54 optimizer.py:875] No resource satisfying {'M60': 1} on [GCP].
sky.exceptions.ResourcesUnavailableError: No launchable resource found for task sky-cmd. To fix: relax its resource requirements.
Hint: 'sky show-gpus --all' to list available accelerators.
'sky check' to check the enabled clouds.
$ sky launch --instance-type n1-highmem-8 --gpus M60 ''
sky.exceptions.ResourcesUnavailableError: M60 is not available in GCP. See 'sky show-gpus --cloud gcp'
In the first case, the optimizer asks itself which instance to choose, and finds that GCP does not support M60. On the other hand, in the second case, the optimizer checks whether M60 can be attached to n1-highmem-8, and the new checks added in gcp_catalog
finds the error. Since the two cases take different paths, the error messages are different.
I changed the implementation substantially. The PR now consists of two new functions The first The second @concretevitamin Could you please take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @WoosukKwon with a minor question. Reminder to rerun smoke tests before merging.
sky/optimizer.py
Outdated
@@ -887,4 +883,10 @@ def _fill_in_launchable_resources( | |||
launchable[resources] = _filter_out_blocked_launchable_resources( | |||
launchable[resources], blocked_launchable_resources) | |||
|
|||
for r in launchable[resources]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why move it to here, rather than after L849? Was thinking checking resources
in that loop makes more sense, as it represents a validation of the user-requested resources. Here, it may be possible than launchable[resources]
has more than 1 "expanded" resources, and throwing an error on these may be unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by the "expanded" resources? I thought this check should be applied to every case, as the max cpu and memory limits must be respected to launch an instance on GCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that here launchable[resources] may have more than 1 element, - can some of them pass the check, while some fail? In these cases it may make sense to remove the candidates that fail rather than raising an error to the whole program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That makes sense. I've rolled back the change.
clouds: CloudFilter = None) -> None: | ||
"""GCP only: Check if host VM type is compatible with the accelerators.""" | ||
"""GCP only: Check if host VM type is compatible with the accelerators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add #989 (comment) to this func and the next func (L207+)? It's great explanation on why these two funcs are structured this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
I've checked that this PR does not break any smoke test. |
@concretevitamin If you don't have any more concern about this PR, I'll merge it. |
Let’s ship it! |
Let’s ship it!
…On Wed, Aug 31, 2022 at 14:31 Woosuk Kwon ***@***.***> wrote:
@concretevitamin <https://github.com/concretevitamin> If you don't have
any more concern about this PR, I'll merge it.
—
Reply to this email directly, view it on GitHub
<#989 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEQWHUJYT5XIY3CKBCLAH3V37FJ5ANCNFSM533DAXVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This PR checks compatibility between GCP host VMs and accelerators. For example, GPUs (except A100) can be only attached to N1 machines, and each GPU has limitations on the number of vCPUs and amount of CPU memory that its host VM can have. This PR hard-codes such information in
gcp_catalog.py
and lets users know when their requests are invalid.Tested:
sky gpunode --instance-type n1-highmem-16 --gpus K80 -c test
(invalid)sky gpunode --instance-type n1-highmem-16 --gpus K80:2 -c test
(valid)sky gpunode --instance-type n1-highmem-16 --gpus A10G -c test
(invalid)sky gpunode --instance-type a2-highgpu-1g -c test
(invalid)sky gpunode --instance-type a2-highgpu-1g --gpus A100:2 -c test
(invalid)sky gpunode --instance-type n1-highcpu-16 --gpus A100 -c test
(invalid)