Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage][Spot] Resolving failure to exclude local files for managed spot jobs #1939

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented May 7, 2023

This PR closes #1898

Normally, when you launch a cluster with Skypilot, workdir and local file_mounts sources are uploaded directly to the remote VM with rsync. However, for managed spot jobs, those are uploaded to Cloud Object Storages with aws or gsutil CLI. Due to this difference, managed spot jobs were failing to ignore the files either in .gitignore or .git/info/exclude. This PR fixes the issue by implementing a feature to excluding the files from being uploaded to the Cloud Storages.

Adding a test to see if the process of local source (workdir and local file_mounts sources in this case) being uploaded to Cloud Object Storage with COPY mode respects the .gitignore and .git/info/exclude to exclude the files from being uploaded.

+update:
There are some discrepancies between how git handles the strings in .gitignore or .git/info/exclude and how --exclude filter from aws s3 sync or -x filter from gsutil rsync handles the passed in strings. Hence, in order to resolve the discrepancy, I followed how git handles strings read from .gitignore based on this doc.

Following is the structure of the directories I used to test the updated implementation:

Screenshot 2023-05-24 at 11 09 02 PM

Following is the .gitignore file I included in the test-spot-exclusive directory:

# This test should upload only include.txt, included.log, and included/included.l$
**/double_asterisk_excluded
**/double_asterisk_excluded_dir
**/parent/*.txt
**/parent/child/*.txt
*.log
exp-*/
!included.log
!included/included.*
excluded.*
/front_slash_excluded
question_mark/excluded?.txt
square_bracket/excluded[0-9].txt
square_bracket_single/excluded[01].txt
square_bracket_excla/excluded[!01].txt
square_bracket_alpha/excluded[a-z].txt
excluded_dir/
nested_double_asterisk/**/excluded.txt
nested_wildcard_dir/*day/excluded.txt

Current update successfully understands the format given in .gitignore above besides two exceptions relating to !:
For s3 and r2(aws s3 sync):
Items preceding with ! in .gitignore overrides other items to be excluded. For an instance, using the example from the doc,
Screenshot 2023-05-16 at 8 28 35 PM
git would understand the given order in .gitignore as to exclude important/trace.log as trace.* appears after !important/*.log. However, for our implementation, any item preceding with ! will override the items to be excluded regardless of the order. In other words, important/trace.log will be included. This is due to the how --include from aws s3 sync behaves. --include overrides every files passed to --exclude.

For gs(gsutil rsync):
! used to include(such as !included.log and not square_bracket_excla/excluded[!01].log) is not supported at all. For aws s3 sync, we can pass in --include to implement !, but gsutil rsync does not have any functionality to specify which file to include(ref). It only has -x which is to specify the files to be excluded.

With the file structure above and the .gitignore file, the current implementation for aws s3 sync uploads include.txt, included.log, and included/included.log. And gsutil rsync uploads included.txt.

Tested (run the relevant ones):

  • Manually ran sky spot launch and checked the workdir and local file_mounts src from the spot instance
  • Relevant individual smoke tests: pytest tests/test_smoke.py::TestStorageWithCredentials --cloudflare
  • Relevant individual smoke tests: pytest tests/test_smoke.py --managed-spot
  • Additional tests for managed spot jobs added to test_smoke.py/test_excluded_file_cloud_storage_upload_copy

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented May 10, 2023

@concretevitamin @romilbhardwaj Ready for a look!

One thing I was unsure was where to keep get_excluded_list(). The implementation for this function is exactly the same for three different classes, S3Store, GcsStore, and R2Store. However, it's only called within the nested function of batch_aws_rsync() and batch_gsutil_rsync(), get_dir_sync_command(). So I just kept get_excluded_list() as a nested function as well, but if you think it's unnecessarily adding more lines to the code base, I'm open to suggestions!

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I tried running

sky spot launch --workdir=. --cloud gcp echo hi

with a fairly typical .gitignore:

*.egg-info
__pycache__
wandb
*.out
**/.ipynb_checkpoints
data/
exp/
exp-*/
.git

This throws an error

Task from command: echo hi
Launching a new spot task 'sky-da93-zongheng'. Proceed? [Y/n]:
I 05-12 09:50:09 execution.py:708] Translating file_mounts with local source paths to SkyPilot Storage...
I 05-12 09:50:09 execution.py:733] Workdir '.' will be synced to cloud storage 'skypilot-workdir-zongheng-4d3146cf'.
I 05-12 09:50:09 execution.py:805] Uploading sources to cloud storage. See: sky storage ls
I 05-12 09:50:12 storage.py:1653] Created GCS bucket skypilot-workdir-zongheng-4d3146cf in US-CENTRAL1 with storage class STANDARD
W 05-12 09:50:12 storage.py:793] '.git' directory under '.' is excluded during sync.
⠙ Syncing . to gs://skypilot-workdir-zongheng-4d3146cf/E 05-12 09:50:13 data_utils.py:224] CommandException: Invalid exclude filter ((.git/*|.gitignore|*.egg-info|__pycache__|wandb|*.out|**/.ipynb_checkpoints|data/|exp/|exp-*/|.git))
E 05-12 09:50:13 data_utils.py:224]
E 05-12 09:50:13 storage.py:805] Could not upload . to store name skypilot-workdir-zongheng-4d3146cf.
sky.exceptions.StorageUploadError: Upload to bucket failed for store skypilot-workdir-zongheng-4d3146cf. Please check the logs.

Looks like we need to robustify the parsing / handling?

sky/data/storage.py Show resolved Hide resolved
sky/data/storage.py Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj added this to the v0.3 milestone May 16, 2023
@landscapepainter
Copy link
Collaborator Author

landscapepainter commented May 17, 2023

@concretevitamin Ready for another look! Please refer to the updated comment at the top.

The original get_excluded_list() is now different between [s3,r2] and gs. Also, they are divided in to serveral parts. So I have not put it as a module level function. Please let me know if you think it needs any updates on the structure.

@romilbhardwaj
Copy link
Collaborator

Awesome work @landscapepainter! I'll take a look. This feature is likely quite involved since we're implementing our own gitignore parsing. Can you share a script/tar of the dir structure you used (the one in the screenshot above) for testing, and ideally add that to smoke tests?

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on !:

For .gitignore, git comprehends a preceding ! as an exception. For example, if you have the followings in the .gitignore,

*.log
!important/*.log

it ignores every file that ends with .log in the directory and subdirectories, besides the ones in important subdirectory.

And there are differences on the behavior of ! either it being prefixed to a file or a directory. Consider the following examples:

*.log
!important.log

Above .gitignore file wants to ignore every files that ends with .log besides important.log, and this works fine. On the other hand, for directories, consider the following example:

logs/
!logs/important.log

You may think every items in logs directory would be ignored besides important.log, but logs/important.log gets ignored as well since negating file is impossible if it matches the pattern for a directory. This seems like pattern matched for a directory has higher priority than pattern matched for a file.

Following is just regurgitating what I mentioned at the top.

*.log
!important/*.log
trace.*

Above .gitignore file wants to ignore every files that ends with .log besides the ones in important directory. However, important/trace.log will be ignored as trace.* was added after !important/*.log and this overrides it.

I'm not sure how often people use !, but if we are going to implement this exactly as how git interacts with .gitignore, it seems necessary for data_utils.parallel_upload() to run gsutil rsync and aws s3 sync commands variable number of times by breaking down the directories/files into several pieces as --include from aws s3 sync overrides every --excluded items, and gsutil rsync does not have an option like --include and it only has excluding option, -x.

}

@staticmethod
def create_dir_structure(base_path, structure):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function and a structure that can be used to create more robust testing structure in temporary destination with with tempfile.TemporaryDirectory() as tempdir: Will update with a complete test soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add '.gitignore' corresponding to the structure as part of the test as well

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Mostly done with the test using complex structure for robust implementation. But we need to decide what to do with '!' to complete the test, so I can set the standards for either or not the test has passed.

@landscapepainter
Copy link
Collaborator Author

implementing with git instead #2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spot launch doesn't respect .gitignore
3 participants