Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Fix oslogin username in clusters #1106

Merged
merged 7 commits into from
Aug 20, 2022
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Aug 20, 2022

In the GCP cluster launched by ray up, our uploaded GCP credentials ~/.config/gcloud/configurations/config_default will be reset to empty, probably because ray needs to use the service account instead of the user's account. This will affect our spot controller, as it requires that file to decide the username used for the spot clusters, when oslogin is enabled in the user project.

project_oslogin = next(
(item for item in project['commonInstanceMetadata'].get('items', [])
if item['key'] == 'enable-oslogin'), {}).get('value', 'False')
if project_oslogin.lower() == 'true':
# project.
logger.info(
f'OS Login is enabled for GCP project {project_id}. Running '
'additional authentication steps.')
config_path = os.path.expanduser(GCP_CONFIGURE_PATH)
if not os.path.exists(config_path):
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
'GCP authentication failed, as the oslogin is enabled but '
f'the file {config_path} is not found.')
with open(config_path, 'r') as infile:
for line in infile:
if line.startswith('account'):
account = line.split('=')[1].strip()
break
else:
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
'GCP authentication failed, as the oslogin is enabled '
f'but the file {config_path} does not contain the '
'account information.')
config['auth']['ssh_user'] = account.replace('@', '_').replace('.', '_')

To solve the problem, we make a backup of that file in the same folder, which will be uploaded by the credential uploads. SkyPilot will read the backup file instead.

Hey @lhqing, please check out if this can fix your problem with the sky spot launch when you get time. ; )

@concretevitamin
Copy link
Member

LGTM

@Michaelvll Michaelvll merged commit 9e70d15 into master Aug 20, 2022
@Michaelvll Michaelvll deleted the fix-oslogin-username branch August 20, 2022 22:59
@lhqing
Copy link
Contributor

lhqing commented Aug 21, 2022

@Michaelvll Thank you for the quick fix!

However, I tried again after updating the Skypilot with your fix, but the same problem still exists.

Provision still failed due to no account in '~/.config/gcloud/configurations/config_default':

# provision log at the spo-controller using "sky logs sky-spot-contorller-f2f7302d"
...
(manual1 pid=489) I 08-21 02:23:35 authentication.py:208] OS Login is enabled for GCP project prod-635e. Running additional authentication steps.
(manual1 pid=489) I 08-21 02:23:35 authentication.py:208] OS Login is enabled for GCP project prod-635e. Running additional authentication steps.
(manual1 pid=489) W 08-21 02:23:35 common_utils.py:184] Caught GCP authentication failed, as the oslogin is enabled but the file /home/hanliu_salk_edu/.config/gcloud/configurations/config_default does not contain the account information.. Retrying.
...
# and eventually, this provision failed

When I ssh into the contorller, I can see the updated sky now create a .sky_config_default backup file in ~/.config/gcloud/configurations. However, both this file and the config_default file on controller's .config are empty

hanliu_salk_edu@ray-sky-spot-controller-f2f7302d-head-ae53055b-compute:~/.config/gcloud$ ls configurations/
.sky_config_default  config_default
hanliu_salk_edu@ray-sky-spot-controller-f2f7302d-head-ae53055b-compute:~/.config/gcloud$ cat configurations/.sky_config_default # empty
hanliu_salk_edu@ray-sky-spot-controller-f2f7302d-head-ae53055b-compute:~/.config/gcloud$ cat configurations/config_default # empty

And I checked again my local computer's ~/.config/gcloud/configurations/config_default and .sky_config_default both are normal.

$ cat ~/.config/gcloud/configurations/.sky_config_default
# has correct content
$ cat ~/.config/gcloud/configurations/config_default
# has correct content

My summary is:

  1. Sky creates a .sky_config_default on my local computer, both config_default has correct content.
  2. Sky provision.log indicates both files are copied to spot-controller
  3. However, at the time when authentication.py trying to read account info from controller's ~/.config/gcloud/configurations/, both config file in that dir is empty
    Something happened between 2 and 3 that clears the file contents in that dir, but the file still exists.

Logs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants