Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP: make sky check enable the 3 required APIs & optional TPU API. #1197

Merged
merged 3 commits into from
Oct 7, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 36 additions & 3 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,17 @@
import hashlib
import os
import pathlib
import re
import subprocess
import sys
import time
import uuid

import colorama
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.hazmat.backends import default_backend
import googleapiclient

from sky import clouds
from sky import sky_logging
Expand All @@ -19,8 +23,7 @@
from sky.utils import subprocess_utils
from sky.utils import ux_utils

logger = sky_logging.init_logger(__name__)

colorama.init()
logger = sky_logging.init_logger(__name__)

# TODO: Should tolerate if gcloud is not installed. Also,
Expand Down Expand Up @@ -195,7 +198,37 @@ def setup_gcp_authentication(config):
credentials=None,
cache_discovery=False)
user = config['auth']['ssh_user']
project = compute.projects().get(project=project_id).execute()

try:
project = compute.projects().get(project=project_id).execute()
except googleapiclient.errors.HttpError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to add another except at the end to catch other unexpected errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should catch blindly. We should let those unexpected errors surface so that we know what's up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that makes sense. was thinking about printing some additional msgs but I don't think it's necessary as users can't easily fix this themselves.

# Can happen for a new project where Compute Engine API is disabled.
# Ex:
# 'Compute Engine API has not been used in project 123456 before
# or it is disabled. Enable it by visiting
# https://console.developers.google.com/apis/api/compute.googleapis.com/overview?project=123456
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use

https://console.developers.google.com/apis/api/compute.googleapis.com/overview

without project=123456. GCP will redirect us to the correct project. with 123456 it'd say no such project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example message meant for reminding us developers. Changed the comment a bit to be clearer.

# then retry. If you enabled this API recently, wait a few minutes for
# the action to propagate to our systems and retry.'
if ' API has not been used in project' in e.reason:
match = re.fullmatch(r'(.+)(https://.*project=\d+) (.+)', e.reason)
if match is None:
raise # This should not happen.
yellow = colorama.Fore.YELLOW
reset = colorama.Style.RESET_ALL
bright = colorama.Style.BRIGHT
dim = colorama.Style.DIM
logger.error(
f'{yellow}Certain GCP APIs are disabled for the GCP project '
f'{project_id}.{reset}')
logger.error(f'{yellow}Enable them by running:{reset} '
f'{bright}sky check{reset}')
logger.error('Details:')
logger.error(f'{dim}{match.group(1)}{reset}\n'
f'{dim} {match.group(2)}{reset}\n'
f'{dim}{match.group(3)}{reset}')
sys.exit(1)
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to raise Runtime error with more details printed if different error happens? also to catch possible message pattern change in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note (1) this statement will already print details of the exception, and (2) there's a @retry decorator on this func, which will be retried 3 times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah now I see. thanks for clarification.


project_oslogin = next(
(item for item in project['commonInstanceMetadata'].get('items', [])
Expand Down
51 changes: 51 additions & 0 deletions sky/clouds/gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import json
import os
import subprocess
import time
import typing
from typing import Dict, Iterator, List, Optional, Tuple

Expand Down Expand Up @@ -61,6 +62,16 @@ def _run_output(cmd):
return proc.stdout.decode('ascii')


def is_api_disabled(endpoint: str, project_id: str) -> bool:
proc = subprocess.run((f'gcloud services list --project {project_id} '
f' | grep {endpoint}.googleapis.com'),
check=False,
shell=True,
stderr=subprocess.PIPE,
stdout=subprocess.PIPE)
return proc.returncode != 0


@clouds.CLOUD_REGISTRY.register
class GCP(clouds.Cloud):
"""Google Cloud Platform."""
Expand Down Expand Up @@ -347,6 +358,46 @@ def check_credentials(self) -> Tuple[bool, Optional[str]]:
'For more info: '
'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html' # pylint: disable=line-too-long
)

# Check APIs.
project_id = self.get_project_id()
apis = (
('compute', 'Compute Engine'),
('cloudresourcemanager', 'Cloud Resource Manager'),
('iam', 'Identity and Access Management (IAM)'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I verified that a new project doesn't have tpu enabled. After sky check, a sky tpunode -y will be stuck forever in

....
I 10-06 11:39:18 cloud_vm_ray_backend.py:954] Provisioning TPU on GCP us-central1 (us-central1-b)
⠴ Provisioning TPU sky-tpunode-zongheng

Then, manually running

> .../google-cloud-sdk-372.0.0-0/lib/gcloud.py compute tpus create sky-tpunode-zongheng --zone=us-central1-b --version=2.5.0 --accelerator-type=v2-8

API [tpu.googleapis.com] not enabled on project [929236553705]. Would you like to enable and retry (this will take a few minutes)? (y/N)?

we see this prompt is why it's stuck.

Changed sky check to enable tpu.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for verifying this!

)
enabled_api = False
for endpoint, display_name in apis:
if is_api_disabled(endpoint, project_id):
# For 'compute': ~55-60 seconds for the first run. If already
# enabled, ~1s. Other API endpoints take ~1-5s to enable.
if endpoint == 'compute':
suffix = ' (may take a minute)'
else:
suffix = ''
print(f'\nEnabling {display_name} API{suffix}...')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say this doesn't incur any cost? as last time a Stanford user was concerning about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done.

t1 = time.time()
proc = subprocess.run(
f'gcloud services enable {endpoint}.googleapis.com '
f'--project {project_id}',
check=False,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
if proc.returncode == 0:
enabled_api = True
print(f'Done. Took {time.time() - t1:.1f} secs.')
else:
print('Failed; please manually enable the API. Log:')
print(proc.stdout.decode())
return False, (
f'{display_name} API is disabled. Please retry '
'`sky check` in a few minutes, or manually enable it.')
if enabled_api:
print('\nHint: Enabled GCP API(s) may take a few minutes to take '
'effect. If any SkyPilot commands/calls failed, retry after '
'some time.')

return True, None

def get_credential_file_mounts(self) -> Dict[str, str]:
Expand Down