-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP: make sky check
enable the 3 required APIs & optional TPU API.
#1197
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this and all the careful tests! Left some minor comments.
apis = ( | ||
('compute', 'Compute Engine'), | ||
('cloudresourcemanager', 'Cloud Resource Manager'), | ||
('iam', 'Identity and Access Management (IAM)'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also enable TPU API for users?
https://console.cloud.google.com/apis/library/tpu.googleapis.com?_ga=2.67630993.1652250969.1665079665-1061309080.1630272145
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I verified that a new project doesn't have tpu enabled. After sky check
, a sky tpunode -y
will be stuck forever in
....
I 10-06 11:39:18 cloud_vm_ray_backend.py:954] Provisioning TPU on GCP us-central1 (us-central1-b)
⠴ Provisioning TPU sky-tpunode-zongheng
Then, manually running
> .../google-cloud-sdk-372.0.0-0/lib/gcloud.py compute tpus create sky-tpunode-zongheng --zone=us-central1-b --version=2.5.0 --accelerator-type=v2-8
API [tpu.googleapis.com] not enabled on project [929236553705]. Would you like to enable and retry (this will take a few minutes)? (y/N)?
we see this prompt is why it's stuck.
Changed sky check
to enable tpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for verifying this!
suffix = ' (may take a minute)' | ||
else: | ||
suffix = '' | ||
print(f'\nEnabling {display_name} API{suffix}...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say this doesn't incur any cost? as last time a Stanford user was concerning about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
f'{dim}{match.group(3)}{reset}') | ||
sys.exit(1) | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to raise Runtime error with more details printed if different error happens? also to catch possible message pattern change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note (1) this statement will already print details of the exception, and (2) there's a @retry
decorator on this func, which will be retried 3 times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah now I see. thanks for clarification.
# Ex: | ||
# 'Compute Engine API has not been used in project 123456 before | ||
# or it is disabled. Enable it by visiting | ||
# https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use
https://console.developers.google.com/apis/api/compute.googleapis.com/overview
without project=123456
. GCP will redirect us to the correct project. with 123456
it'd say no such project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example message meant for reminding us developers. Changed the comment a bit to be clearer.
|
||
try: | ||
project = compute.projects().get(project=project_id).execute() | ||
except googleapiclient.errors.HttpError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to add another except
at the end to catch other unexpected errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should catch blindly. We should let those unexpected errors surface so that we know what's up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that makes sense. was thinking about printing some additional msgs but I don't think it's necessary as users can't easily fix this themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @infwinston. PTAL.
Tested on a new project:
» sky tpunode -y
Enabling Compute Engine API (free of charge; this may take a minute)...
Done. Took 57.6 secs.
Enabling Cloud Resource Manager API (free of charge)...
Done. Took 1.9 secs.
Enabling Identity and Access Management (IAM) API (free of charge)...
Done. Took 4.9 secs.
Enabling Cloud TPU API (free of charge)...
Done. Took 14.5 secs.
Hint: Enabled GCP API(s) may take a few minutes to take effect. If any SkyPilot commands/calls failed, retry after some time.
I 10-06 12:09:28 optimizer.py:605] == Optimizer ==
I 10-06 12:09:28 optimizer.py:628] Estimated cost: $5.0 / hour
...
- TPU related smoke tests
apis = ( | ||
('compute', 'Compute Engine'), | ||
('cloudresourcemanager', 'Cloud Resource Manager'), | ||
('iam', 'Identity and Access Management (IAM)'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I verified that a new project doesn't have tpu enabled. After sky check
, a sky tpunode -y
will be stuck forever in
....
I 10-06 11:39:18 cloud_vm_ray_backend.py:954] Provisioning TPU on GCP us-central1 (us-central1-b)
⠴ Provisioning TPU sky-tpunode-zongheng
Then, manually running
> .../google-cloud-sdk-372.0.0-0/lib/gcloud.py compute tpus create sky-tpunode-zongheng --zone=us-central1-b --version=2.5.0 --accelerator-type=v2-8
API [tpu.googleapis.com] not enabled on project [929236553705]. Would you like to enable and retry (this will take a few minutes)? (y/N)?
we see this prompt is why it's stuck.
Changed sky check
to enable tpu.
suffix = ' (may take a minute)' | ||
else: | ||
suffix = '' | ||
print(f'\nEnabling {display_name} API{suffix}...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done.
f'{dim}{match.group(3)}{reset}') | ||
sys.exit(1) | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note (1) this statement will already print details of the exception, and (2) there's a @retry
decorator on this func, which will be retried 3 times.
# Ex: | ||
# 'Compute Engine API has not been used in project 123456 before | ||
# or it is disabled. Enable it by visiting | ||
# https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example message meant for reminding us developers. Changed the comment a bit to be clearer.
|
||
try: | ||
project = compute.projects().get(project=project_id).execute() | ||
except googleapiclient.errors.HttpError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should catch blindly. We should let those unexpected errors surface so that we know what's up.
sky check
automatically enable the 3 required APIs.sky check
enable the 3 required APIs & optional TPU API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now. thanks!
f'{dim}{match.group(3)}{reset}') | ||
sys.exit(1) | ||
else: | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah now I see. thanks for clarification.
apis = ( | ||
('compute', 'Compute Engine'), | ||
('cloudresourcemanager', 'Cloud Resource Manager'), | ||
('iam', 'Identity and Access Management (IAM)'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for verifying this!
|
||
try: | ||
project = compute.projects().get(project=project_id).execute() | ||
except googleapiclient.errors.HttpError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure that makes sense. was thinking about printing some additional msgs but I don't think it's necessary as users can't easily fix this themselves.
Fixes #1019 by making
sky check
automatically enable the 3 required APIs & optional TPU API.Tested:
Create a new project on GCP console.
Run
gcloud init
to choose the new project. Rungcloud auth application-default login
.sky cpunode --cloud gcp
would trigger a hint:First run
At this point, after a few minutes
sky cpunode --cloud gcp -c dbg -y
succeeds.Second run
If we disabled the APIs (
gcloud services disable compute.googleapis.com cloudresourcemanager.googleapis.com iam.googleapis.com
) just a few seconds a go,sky check
would correctly show an error (due to GCP error) and keep GCP as disabled: