Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage][Spot] Resolving failure to exclude local files for managed spot jobs using git --ignored outputs #2018

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Jun 3, 2023

This PR closes #1898 and opposed to implementing our own .gitignore filter in #1939 and using rsync from #1993 , this harness git -C /path/to/source/dir status --ignored -s to get a list of files/directories to be excluded from syncing sky storage.

Added the testing structure and the files supposed to be ignored from syncing divided into git_info_exclude_test(corresponding to .git/info/exclude) and gitignore_test(.gitignore).

Tested (run the relevant ones):

  • Manually ran sky spot launch and checked the workdir and local file_mounts src from the spot instance
  • Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials --cloudflare
  • Relevant individual smoke tests: pytest tests/test_smoke.py --managed-spot
  • Additional tests for managed spot jobs added to test_smoke.py/test_excluded_file_cloud_storage_upload_copy

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left some comments. Still reading.

sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
# This command outputs a list to be excluded according to .gitignore
# and .git/info/exclude
filter_cmd = f'git -C {expand_src_dir_path} status --ignored -s'
excluded_list: List[str] = ['.git/*', '.gitignore']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not want to exclude .gitignore by default, since that is often a part of the repository.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Why do we not apply the same logic to .git/*? .git/* was excluded by default even before this update.

sky/data/storage.py Outdated Show resolved Hide resolved
filter_cmd = f'git -C {expand_src_dir_path} status --ignored -s'
excluded_list: List[str] = ['.git/*', '.gitignore']

# pylint: disable=W1510
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to disable W1510? Instead, should we add check=True for L907, L911, L925 and L930? Good to explicitly handle any errors that may occur when running them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origianally, I thought the commands were causing inevitable errors, but found that it actually had some issues. They are currently fixed and removed # pylint: disable=W1510 as well.

sky/exceptions.py Outdated Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
# when the SRC_DIR_PATH is not a git repo
if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:
init_cmd = f'git -C {expand_src_dir_path} init'
subprocess.run(init_cmd,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may fail due to different reasons (e.g., user has read-only permissions in the dir). In that case, can we raise a warning and gracefully fall back to not excluding any files from .gitignore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be printing a logger.wanring() message and return the default excluded_list with .git/* in it.

stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True)
if git_exclude_exists:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking if git exclude already existed, can we instead check if the .git directory already existed? If it did, we should leave it untouched (i.e., do not remove any files or directories). This is to avoid inadvertently deleting anything we shouldn't be deleting.

Copy link
Collaborator Author

@landscapepainter landscapepainter Jun 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .git alreday existed, this if git_exclude_exists: statement will not be entered. The outer if e.returncode == exceptions.GIT_UNINITIALIZED_CODE: would be entered when .git didn't exist in this directory. I'll leave a more detailed comment to note that this exception is caught only when .git didn't exist(not a git repo).

        except subprocess.CalledProcessError as e:
            # when the SRC_DIR_PATH is not a git repo and .git
            # does not exist in it
            if e.returncode == exceptions.GIT_UNINITIALIZED_CODE:

tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
tests/test_smoke.py Outdated Show resolved Hide resolved
up_dir_output = subprocess.check_output(up_dir_cmd, shell=True)

# GCS ls and aws s3 list-objects outputs in a different way
# Only 'included.*' files should exist in the cloud object storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment and why separate handling for GCS is required.. can you include some example of why this is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated cli_count_name_in_bucket for GCS by adding -r option in the command so that GCS doesn't need separate handling. The issue was, original gsutil ls command was not counting the objects in the subdirectory, include_dir/included.log, so I was handling them separately, but this was able to be fixed with -r option.

tests/test_smoke.py Outdated Show resolved Hide resolved
landscapepainter and others added 4 commits June 8, 2023 20:14
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj Ready for another look!

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left two comments, should be good to go after that!

sky/data/storage_utils.py Outdated Show resolved Hide resolved
sky/data/storage_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Resolved the comments! Ready for another look.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter - this is ready to go!

@landscapepainter landscapepainter merged commit f1bc2c6 into skypilot-org:master Jul 26, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spot launch doesn't respect .gitignore
2 participants