Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: make sky check enable the 3 required APIs & optional TPU API. #1197

Merged
merged 3 commits into from
Oct 7, 2022

Conversation

concretevitamin
Copy link
Member

@concretevitamin concretevitamin commented Oct 5, 2022

Fixes #1019 by making sky check automatically enable the 3 required APIs & optional TPU API.

Tested:

Create a new project on GCP console.

Run gcloud init to choose the new project. Run gcloud auth application-default login.

sky cpunode --cloud gcp would trigger a hint:

Screen Shot 2022-10-04 at 18 15 38

First run

» sky check                                                                              
Checking credentials to enable clouds for SkyPilot.
  AWS: enabled
  Azure: enabled
  Checking GCP...
Enabling Compute Engine API (may take a minute)...
Done. Took 54.8 secs.

Enabling Cloud Resource Manager API...
Done. Took 1.8 secs.

Enabling Identity and Access Management (IAM) API...
Done. Took 4.2 secs.
Hint: Enabled GCP API(s) may take a few minutes to take effect. If any SkyPilot commands/calls failed, retry after some time.
  GCP: enabled

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.

At this point, after a few minutes sky cpunode --cloud gcp -c dbg -y succeeds.

Second run

» sky check
Checking credentials to enable clouds for SkyPilot.
  AWS: enabled
  Azure: enabled
  GCP: enabled

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.

If we disabled the APIs (gcloud services disable compute.googleapis.com cloudresourcemanager.googleapis.com iam.googleapis.com) just a few seconds a go, sky check would correctly show an error (due to GCP error) and keep GCP as disabled:

» sky check                                                                               
Checking credentials to enable clouds for SkyPilot.
  AWS: enabled
  Azure: enabled
  Checking GCP...
Enabling Compute Engine API (may take a minutes)...
Failed; please manually enable the API. Log:
ERROR: (gcloud.services.enable) The operation "operations/acf.p2-1059686784451-81d94d84-7747-4115-84c3-9d9b2ee49c89" resulted in a failure "[Error in service 'compute.googleapis.com': The service is currently being deactivated and deactivation must complete before activation can occur.
Help Token: AWzfkCOcwCQn6uEAOnI6vTCyA_X0NHZTLZaeW0q52SKVF5u5WH-WPmn-6IuTHB1pL6-7wxM84Noij7OEP2SXhM-grw1HwfW_QwUfd1QXzDPr-7TU] with failed services [compute.googleapis.com]".
Details: "[<DetailsValueListEntry
 additionalProperties: [<AdditionalProperty
 key: '@type'
 value: <JsonValue
 string_value: 'type.googleapis.com/google.rpc.PreconditionFailure'>>, <AdditionalProperty
 key: 'violations'
 value: <JsonValue
 array_value: <JsonArray
 entries: [<JsonValue
 object_value: <JsonObject
 properties: [<Property
 key: 'type'
 value: <JsonValue
 string_value: 'googleapis.com'>>, <Property
 key: 'subject'
 value: <JsonValue
 string_value: '160008'>>]>>]>>>]>]".

  GCP: disabled
    Reason: Compute Engine API is disabled. Please retry `sky check` in a few minutes, or manually enable it.

SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this and all the careful tests! Left some minor comments.

apis = (
('compute', 'Compute Engine'),
('cloudresourcemanager', 'Cloud Resource Manager'),
('iam', 'Identity and Access Management (IAM)'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I verified that a new project doesn't have tpu enabled. After sky check, a sky tpunode -y will be stuck forever in

....
I 10-06 11:39:18 cloud_vm_ray_backend.py:954] Provisioning TPU on GCP us-central1 (us-central1-b)
⠴ Provisioning TPU sky-tpunode-zongheng

Then, manually running

> .../google-cloud-sdk-372.0.0-0/lib/gcloud.py compute tpus create sky-tpunode-zongheng --zone=us-central1-b --version=2.5.0 --accelerator-type=v2-8

API [tpu.googleapis.com] not enabled on project [929236553705]. Would you like to enable and retry (this will take a few minutes)? (y/N)?

we see this prompt is why it's stuck.

Changed sky check to enable tpu.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for verifying this!

suffix = ' (may take a minute)'
else:
suffix = ''
print(f'\nEnabling {display_name} API{suffix}...')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say this doesn't incur any cost? as last time a Stanford user was concerning about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

f'{dim}{match.group(3)}{reset}')
sys.exit(1)
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to raise Runtime error with more details printed if different error happens? also to catch possible message pattern change in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (1) this statement will already print details of the exception, and (2) there's a @retry decorator on this func, which will be retried 3 times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah now I see. thanks for clarification.

# Ex:
# 'Compute Engine API has not been used in project 123456 before
# or it is disabled. Enable it by visiting
# https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use

https://console.developers.google.com/apis/api/compute.googleapis.com/overview

without project=123456. GCP will redirect us to the correct project. with 123456 it'd say no such project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example message meant for reminding us developers. Changed the comment a bit to be clearer.


try:
project = compute.projects().get(project=project_id).execute()
except googleapiclient.errors.HttpError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add another except at the end to catch other unexpected errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should catch blindly. We should let those unexpected errors surface so that we know what's up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that makes sense. was thinking about printing some additional msgs but I don't think it's necessary as users can't easily fix this themselves.

Copy link
Member Author

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @infwinston. PTAL.

Tested on a new project:

» sky tpunode -y

Enabling Compute Engine API (free of charge; this may take a minute)...
Done. Took 57.6 secs.

Enabling Cloud Resource Manager API (free of charge)...
Done. Took 1.9 secs.

Enabling Identity and Access Management (IAM) API (free of charge)...
Done. Took 4.9 secs.

Enabling Cloud TPU API (free of charge)...
Done. Took 14.5 secs.

Hint: Enabled GCP API(s) may take a few minutes to take effect. If any SkyPilot commands/calls failed, retry after some time.
I 10-06 12:09:28 optimizer.py:605] == Optimizer ==
I 10-06 12:09:28 optimizer.py:628] Estimated cost: $5.0 / hour
...
  • TPU related smoke tests

apis = (
('compute', 'Compute Engine'),
('cloudresourcemanager', 'Cloud Resource Manager'),
('iam', 'Identity and Access Management (IAM)'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I verified that a new project doesn't have tpu enabled. After sky check, a sky tpunode -y will be stuck forever in

....
I 10-06 11:39:18 cloud_vm_ray_backend.py:954] Provisioning TPU on GCP us-central1 (us-central1-b)
⠴ Provisioning TPU sky-tpunode-zongheng

Then, manually running

> .../google-cloud-sdk-372.0.0-0/lib/gcloud.py compute tpus create sky-tpunode-zongheng --zone=us-central1-b --version=2.5.0 --accelerator-type=v2-8

API [tpu.googleapis.com] not enabled on project [929236553705]. Would you like to enable and retry (this will take a few minutes)? (y/N)?

we see this prompt is why it's stuck.

Changed sky check to enable tpu.

suffix = ' (may take a minute)'
else:
suffix = ''
print(f'\nEnabling {display_name} API{suffix}...')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

f'{dim}{match.group(3)}{reset}')
sys.exit(1)
else:
raise
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (1) this statement will already print details of the exception, and (2) there's a @retry decorator on this func, which will be retried 3 times.

# Ex:
# 'Compute Engine API has not been used in project 123456 before
# or it is disabled. Enable it by visiting
# https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example message meant for reminding us developers. Changed the comment a bit to be clearer.


try:
project = compute.projects().get(project=project_id).execute()
except googleapiclient.errors.HttpError as e:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should catch blindly. We should let those unexpected errors surface so that we know what's up.

@concretevitamin concretevitamin changed the title GCP: make sky check automatically enable the 3 required APIs. GCP: make sky check enable the 3 required APIs & optional TPU API. Oct 6, 2022
Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. thanks!

f'{dim}{match.group(3)}{reset}')
sys.exit(1)
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah now I see. thanks for clarification.

apis = (
('compute', 'Compute Engine'),
('cloudresourcemanager', 'Cloud Resource Manager'),
('iam', 'Identity and Access Management (IAM)'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for verifying this!


try:
project = compute.projects().get(project=project_id).execute()
except googleapiclient.errors.HttpError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that makes sense. was thinking about printing some additional msgs but I don't think it's necessary as users can't easily fix this themselves.

@concretevitamin concretevitamin merged commit bbfa0de into master Oct 7, 2022
@concretevitamin concretevitamin deleted the merge-local branch October 7, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UX: handle GCP APIs "not been used" messages gracefully
2 participants