Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod 823/cache #1245

Merged
merged 26 commits into from
Sep 12, 2024
Merged

Pod 823/cache #1245

merged 26 commits into from
Sep 12, 2024

Conversation

bkneis
Copy link
Contributor

@bkneis bkneis commented Sep 3, 2024

This PR contains the changes to facilitate the use of remote caching via a registry to speed up build times. It supports docker and kubernetes (via kaniko) and uses the context options REGISTRY_CACHE as the registry url. Caching options are not exposed currently in order to ensure the expected functionality.

The PR was tested using the following workflow:

Without cache

  1. Build examples/build 2m35s
  2. Add ghcr.io/devcontainers/features/github-cli:1 as a feature to devcontainer.json 2m45s
  3. Add files causing the build context to change (echo "test" > examples/build/app/test) 2m52s

With cache

  1. Build examples/build 2m13s
  2. Add ghcr.io/devcontainers/features/github-cli:1 as a feature to devcontainer.json 1m25s
  3. Add files causing the build context to change (echo "test" > examples/build/app/test) 13s

As you can see step 1 is almost the same, since no cache has been made yet. Step 2 however is interesting, we saved around 50% of the build time with the cache but 1m16s was used to upload the cache back to the registry. This shows that it is important for us to either omit the --cache-to parameter for up (but not build), or suppress pushing of the cache manifest until the end of the command in the background once the workspace / IDE has already launched.

Also note that I am using my local docker / kind cluster and a remote registry, when the registry is closer to the devcontainer I would expect even greater savings in build time due to shorter download times.

Lastly we have step 3 that simulates a common problematic workflow, which is a user updating the build context then uping a workspace. Here we see significant savings where only building the last layer is needed, instead of the entire image due to a cache miss.

EDIT: I have now implemented a boolean ExportCache in toggle the --cache-to parameter, we now only upload the cache when running build, not up. This provides up even faster start times of workspaces as we don't wait until the cache is uploaded.

Also note for machine providers I needed to enable the containerd snapshotter flag in the docker daemon, this was done during initWorkspace. For non machine providers like local docker we expect the user to enable this themselves. When trying to use the remote cache without this a WARNING will be printed from docker.

NOTE: We need to merge loft-sh/dockerless#27 first and update the dockerless release tag in single.go

Lastly, I have updated the CalculatePrebuildHash function to traverse the parsed Dockerfile and extract any file paths that could affect the build context. I then filter for these files as an "include" list before adding the path to the hashed contents. This should cause fewer cache misses and allow developers to reuse similar images.

@bkneis bkneis marked this pull request as draft September 3, 2024 14:47
@bkneis bkneis marked this pull request as ready for review September 5, 2024 08:35
cmd/agent/workspace/up.go Outdated Show resolved Hide resolved

return nil
// reload docker daemon
return exec.CommandContext(ctx, "pkill", "-HUP", "dockerd").Run()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the best way to reload the docker daemon? I was surprised docker did not have a command for this and I didn't think I could assume systemd was being used to manage docker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks scary 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't think of a better way to do it though, unless we'd start this build container with already mounted config? is that even possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe I could try to detect how dockerd is managed (systemd,systemctl etc) using uname and if not available use this as a last resort? Yes that is possible, I thought of the mounted config but this would be provider specific and need to be implemented by each machine provider, gcloud, aws, digital ocean etc. :/ At least with HUP it's a reload and not restart but it's a bit "under the hood"

@janekbaraniewski
Copy link
Contributor

LGTM!

cmd/agent/container/setup.go Outdated Show resolved Hide resolved
examples/build/README.md Show resolved Hide resolved
pkg/config/context.go Outdated Show resolved Hide resolved
pkg/devcontainer/build.go Outdated Show resolved Hide resolved
cmd/agent/workspace/up.go Show resolved Hide resolved
pkg/devcontainer/single.go Outdated Show resolved Hide resolved
@pascalbreuninger
Copy link
Member

@bkneis approved with comments, feel free to ignore if not applicable

@bkneis
Copy link
Contributor Author

bkneis commented Sep 10, 2024

@pascalbreuninger fab thanks! Just finishing testing with a remote k8s cluster on GKE then it should be ready to merge. Do I need to update any file to reference the new version of the kubernetes driver? i.e. loft-sh/devpod-provider-kubernetes#55

@pascalbreuninger
Copy link
Member

pascalbreuninger commented Sep 10, 2024

@bkneis nope, that's done by releasing a new version of the kubernetes provider over in the other repo👍
We'll need to wait until we've released it though, right?

@bkneis
Copy link
Contributor Author

bkneis commented Sep 11, 2024

@pascalbreuninger just finished testing with GKE 🥳 It was my auto pilot cluster giving me issues and killing workspaces with a return code 137 (OOM). Using the new GKE cluster workspaces are spinning up no problem and using the cache. In the end I didn't need to make any changes to the kubernetes driver, only dockerless so this PR is good to go IMO

@bkneis bkneis merged commit 2b47efa into loft-sh:main Sep 12, 2024
17 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants