Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure under gVisor: failed to get filesystem from image: chtimes /bin/bash: invalid argument #1020

Closed
ddgenome opened this issue Feb 4, 2020 · 27 comments · Fixed by #1074
Assignees
Labels
gvisor in progress priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. regression

Comments

@ddgenome
Copy link

ddgenome commented Feb 4, 2020

Actual behavior

A build that succeeds under v0.16.0 fails under v0.17.0 with the error:

error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument

Expected behavior

I expect the build to complete successfully.

To Reproduce
Steps to reproduce the behavior:

  1. Create a package.json and package-lock.json using a command like

    npm install --save @atomist/automation-client
    
  2. Run the kaniko build with the following arguments

    --context=dir:///where/you/mount/the/dir --destination=where/you/can:push --dockerfile=Dockerfile --force
    

Additional Information

  • Dockerfile
FROM atomist/sdm-base:0.4.1

COPY package.json package-lock.json ./
RUN npm ci \
    && npm cache clean --force

COPY . ./

USER atomist:atomist
  • Build Context
    Please provide or clearly describe any files needed to build the Dockerfile (ADD/COPY commands)

Any package.json and package-lock.json should suffice.

  • Kaniko Image (fully qualified with digest) gcr.io/kaniko-project/executor:v0.17.0 sha256:c65c64d157bb6b1f15278e8ee28b02184e83e39340ddc25d346f18396c24da1d

Triage Notes for the Maintainers

I am seeing the failure when the build is being run in Kubernetes running on Container-optimized OS with the containerd runtime. The error occurs both when caching is being used and when it is not.

Here are the debug logs without caching:

DEBU[0000] Getting source context from dir:///atm/home
DEBU[0000] Build context located at /atm/home
DEBU[0000] Copying file /atm/home/Dockerfile to /kaniko/Dockerfile
WARN[0000] kaniko is being run outside of a container. This can have dangerous effects on your system
WARN[0000] Error detecting container runtime. Using default keychain: container runtime could not be found
DEBU[0000] Skip resolving path /kaniko/Dockerfile
DEBU[0000] Skip resolving path /atm/home
DEBU[0000] Skip resolving path /cache
DEBU[0000] Skip resolving path
DEBU[0000] Skip resolving path
DEBU[0000] Skip resolving path
INFO[0000] Resolved base name atomist/sdm-base:0.4.1 to atomist/sdm-base:0.4.1
INFO[0000] Using dockerignore file: /atm/home/.dockerignore
INFO[0000] Resolved base name atomist/sdm-base:0.4.1 to atomist/sdm-base:0.4.1
INFO[0000] Retrieving image manifest atomist/sdm-base:0.4.1
DEBU[0000] No file found for cache key sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c stat /cache/sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c: no such file or directory
DEBU[0000] Image atomist/sdm-base:0.4.1 not found in cache
INFO[0000] Retrieving image manifest atomist/sdm-base:0.4.1
INFO[0000] Built cross stage deps: map[]
INFO[0000] Retrieving image manifest atomist/sdm-base:0.4.1
DEBU[0001] No file found for cache key sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c stat /cache/sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c: no such file or directory
DEBU[0001] Image atomist/sdm-base:0.4.1 not found in cache
INFO[0001] Retrieving image manifest atomist/sdm-base:0.4.1
INFO[0001] Unpacking rootfs as cmd COPY package.json package-lock.json ./ requires it.
DEBU[0001] Mounted directories: [{/kaniko false} {/etc/mtab false} {/tmp/apt-key-gpghome true} {/dev false} {/dev/pts false} {/proc false} {/sys false} {/dev/termination-log false} {/atm/home false} {/atm/input false} {/atm/output false} {/etc/hosts false} {/etc/hostname false} {/etc/resolv.conf false} {/kaniko/.docker/config.json false} {/var/run/secrets/kubernetes.io/serviceaccount false} {/tmp false}]
error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument
Description Yes/No
Please check if this a new feature you are proposing
Please check if the build works in docker but not in kaniko
Please check if this error is seen when you use --cache flag
Please check if your dockerfile is a multistage dockerfile
@tejal29 tejal29 self-assigned this Feb 4, 2020
@tejal29 tejal29 added the norepro core team can't replicate this issue, require more info label Feb 4, 2020
@tejal29
Copy link
Member

tejal29 commented Feb 4, 2020

hi @ddgenome i wasnt able to reproduce this on my local docker container.
When you say "Kubernetes running on Container-optimized OS with the containerd runtime" do you mean GKE node?

@ddgenome
Copy link
Author

ddgenome commented Feb 4, 2020

Yes, it is GKE. The cluster is running Kubernetes 1.15.7-gke.23. The nodes are running cos_containerd under gVisor.

The issue seems similar to #1021 .

@tejal29
Copy link
Member

tejal29 commented Feb 4, 2020

This could be due to #940 which went in this release.

@ddgenome
Copy link
Author

ddgenome commented Feb 4, 2020

Well, the error seems to align with that hypothesis. So presumably this line is returning an error:

https://github.com/GoogleContainerTools/kaniko/pull/981/files#diff-d36eb675aa49a7b471e3a2be77005b18R253

Now, since the previous calls to setFilePermissions on the path, which is /bin/bash, worked, the problem is probably not with the path but rather one or both of the times, hdr.AccessTime, hdr.ModTime.

@cvgw
Copy link
Contributor

cvgw commented Feb 4, 2020

I haven't had much time to look into this but I wonder if we need to some sanity checks on the header times. I agree that it seems most likely one of the times is bad. I noticed that golang time.Time and the args expected by Chtimes's underlying syscall are a bit different. Might be that the zero value or some other case don't line up right.

http://man7.org/linux/man-pages/man2/utimensat.2.html
https://golang.org/src/time/time.go?s=6278:7279#L117

@cvgw
Copy link
Contributor

cvgw commented Feb 4, 2020

I'm not able to repro the bug using the executor:v0.17.0 container on my local machine

Dockerfile

FROM atomist/sdm-base:0.4.1

COPY package.json package-lock.json ./
RUN npm ci \
    && npm cache clean --force

COPY . ./

USER atomist:atomist

package.json

{
  "name": "foo",
  "version": "2.0.0",
  "description": "Don't forget to sanitize your inputs",
  "author": "Little Bobby Tables",
  "private": false,
  "devDependencies": {},
  "scripts": {},
  "license": "MIT",
  "dependencies": {}
}

package-lock.json

{
  "name": "foo",
  "version": "2.0.0",
  "lockfileVersion": 1,
  "requires": true,
  "dependencies": {}
}

kaniko command

docker run \
  -v $HOME/.config/gcloud:/root/.config/gcloud  \
  -v `pwd`/context:/workspace \
  gcr.io/kaniko-project/executor:v0.17.0  \
  --context dir:///workspace/  \
  --dockerfile Dockerfile \
  --destination $MY_REPO:1020 \
  --force \
  -v info

@ddgenome
Copy link
Author

ddgenome commented Feb 4, 2020

The issue might be a capability, perhaps CAP_SETFCAP, that os.Chtimes requires to do its work that is not available in the container. I'll investigate that.

@ddgenome
Copy link
Author

ddgenome commented Feb 5, 2020

Troubleshooting this issue further it seems the issue is with kaniko running under gVisor, GKE Sandbox for the specific case I have tested.

I created a permissive pod security policy to eliminate that variable and ran the builds under that policy. With kaniko 0.17.1 I was able to successfully complete the build on a "normal" GKE node but the build failed in the same manner, i.e., error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument, when running on a GKE sandbox/gVisor node.

I can confirm that I can also successfully complete the build using Docker locally using the standard Docker runtime, not runsc.

@ddgenome ddgenome changed the title Regression in v0.17.0: failed to get filesystem from image: chtimes /bin/bash: invalid argument Failure under gVisor: failed to get filesystem from image: chtimes /bin/bash: invalid argument Feb 5, 2020
@ddgenome
Copy link
Author

ddgenome commented Feb 5, 2020

I guess the question now is whether running under gVisor is a goal of kaniko exectutor. If not, I guess this issue can be closed.

@tejal29
Copy link
Member

tejal29 commented Feb 6, 2020

@ddgenome We do want to support running under gVisor and do not want to break that support.
Unfortunately we don't have integration tests to ensure this.

I will fix this and verify manually before next release.

Thanks
Tejal

@tejal29 tejal29 added the gvisor label Feb 6, 2020
@tejal29
Copy link
Member

tejal29 commented Feb 7, 2020

@ddgenome i was looking at gvisor capabilities https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/
Looks like utimes is supported which is what os.Chtimes is.
Which version of gvisor are you on?

@tejal29
Copy link
Member

tejal29 commented Feb 7, 2020

The issue might be a capability, perhaps CAP_SETFCAP, that os.Chtimes requires to do its work that is not available in the container. I'll investigate that.

nvm, you mentioned before you will need to set capability CAP_SETFCAP

@tejal29 tejal29 mentioned this issue Feb 7, 2020
2 tasks
@ddgenome
Copy link
Author

ddgenome commented Feb 7, 2020

Sorry for the confusion, but the capabilities investigation turned out to be a red herring, see this comment where that issue was eliminated. As such, I am not sure whether #1035 will help the issue.

Here is the requested gVisor version information:

# /home/containerd/usr/local/sbin/runsc -version
runsc version google-281502745
spec: 1.0.1-dev

I thought that perhaps os.Chtimes was calling some system call that was unimplemented in gVisor, but looking at the gVisor logs I see no "Unsupported syscall" log messages. I do see a nine "Uncaught signal: "killed" (9), PID: X, TID: 12, fault addr: 0x0" from the pod at the same millisecond when kaniko reports "error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument". The containerd/kubelet logs do not seem to indicate anything is amiss.

Might it be acceptable to just ignore any failure from os.Chtimes, simply debug log the error and move on?

@ddgenome
Copy link
Author

ddgenome commented Feb 7, 2020

If, however you did want to check if the required capabilities were available, I have found you need at least these set:

  • CAP_CHOWN
  • CAP_DAC_OVERRIDE
  • CAP_FOWNER
  • CAP_SETFCAP
  • CAP_SETGID
  • CAP_SETUID

However, building more complicated images may require other capabilities.

@ddgenome
Copy link
Author

I am able to reproduce this locally running the latest release of runsc and the same package.json and package-lock.json you use above with kaniko executor 0.17.1:

# /usr/local/bin/runsc --version
runsc version release-20200127.0
spec: 1.0.1-dev
# docker run --runtime=runsc -it --rm -v /root/.docker:/kaniko/.docker -v `pwd`/context:/workspace gcr.io/kaniko-project/executor:v0.17.1 --dockerfile=Dockerfile --context=/workspace --destination=$DEST --force -v=debug
DEBU[0000] Copying file /workspace/Dockerfile to /kaniko/Dockerfile
WARN[0000] kaniko is being run outside of a container. This can have dangerous effects on your system
WARN[0000] Error detecting container runtime. Using default keychain: container runtime could not be found
DEBU[0000] Skip resolving path /kaniko/Dockerfile
DEBU[0000] Skip resolving path /workspace
DEBU[0000] Skip resolving path /cache
DEBU[0000] Skip resolving path
DEBU[0000] Skip resolving path
DEBU[0000] Skip resolving path
INFO[0000] Resolved base name atomist/sdm-base:0.4.1 to atomist/sdm-base:0.4.1
INFO[0000] Resolved base name atomist/sdm-base:0.4.1 to atomist/sdm-base:0.4.1
INFO[0000] Retrieving image manifest atomist/sdm-base:0.4.1
DEBU[0001] No file found for cache key sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c stat /cache/sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c: no such file or directory
DEBU[0001] Image atomist/sdm-base:0.4.1 not found in cache
INFO[0001] Retrieving image manifest atomist/sdm-base:0.4.1
INFO[0002] Built cross stage deps: map[]
INFO[0002] Retrieving image manifest atomist/sdm-base:0.4.1
DEBU[0002] No file found for cache key sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c stat /cache/sha256:555a4f6fb9b0fb9d180e2e9a0bfcd04c44128cb76eca26516ccb7ba7dd304b5c: no such file or directory
DEBU[0002] Image atomist/sdm-base:0.4.1 not found in cache
INFO[0002] Retrieving image manifest atomist/sdm-base:0.4.1
INFO[0003] Unpacking rootfs as cmd COPY package.json package-lock.json ./ requires it.
DEBU[0003] Mounted directories: [{/kaniko false} {/etc/mtab false} {/tmp/apt-key-gpghome true} {/var/run false} {/dev false} {/dev/pts false} {/proc false} {/sys false} {/workspace false} {/kaniko/.docker false} {/etc/resolv.conf false} {/etc/hostname false} {/etc/hosts false} {/tmp false}]
error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument

If you replace :v0.17.1 with :v0.16.0, the build succeeds.

What version of runsc are you using?

@ddgenome
Copy link
Author

To more fully test the variable space, I ran the following command:

$ for r in runc runsc runsc-nightly runsc-head; do \
    for k in v0.16.0 v0.17.1; do \
      e=gcr.io/kaniko-project/executor:$k; \
      echo "$e"; $r --version; \
      docker run --runtime=$r -it --rm -v $HOME/.docker:/kaniko/.docker -v /vagrant:/workspace \
        "$e" --dockerfile=Dockerfile --context=/workspace \
        --destination="$REGISTRY/kgtest:0.0.0-$(date -u +%Y%m%d%H%M%S)" \
        --force -v=info; \
      echo; \
    done; \
  done

where

  • docker is version 18.09.7, both client and server
  • runc is version spec 1.0.1-dev
  • runsc is configured to the latest release, release-20200127.0
  • runsc-nightly is the current nightly build, release-20200127.0-96-gca30dfa065f5
  • runsc-head is the latest master HEAD build, release-20200127.0-127-g0dd9ee0d1e08
  • gcr.io/kaniko-project/executor:v0.16.0 is gcr.io/kaniko-project/executor@sha256:f116560a308ead98708bd66b80fe06ab19a70355d590060aa4a01703d8b17203
  • gcr.io/kaniko-project/executor:v0.17.1 is gcr.io/kaniko-project/executor@sha256:565d31516f9bb91763dcf8e23ee161144fd4e27624b257674136c71559ce4493

Here are the results:

runc runsc runsc-nightly runsc-head
kaniko 0.16.0 success success success success
kaniko 0.17.1 success fail fail fail

All of the failures have the familiar message:

error building image: error building stage: failed to get filesystem from image: chtimes /bin/bash: invalid argument

Are you able to reproduce running under gVisor (this issue has the norepro label)?

Would you be open to a PR that logs any error from os.Chtimes and continues?

@cvgw
Copy link
Contributor

cvgw commented Feb 13, 2020

@tejal29 have you tested this using runc? I have not (haven't been able to set it up yet). Probably should remove the norepro flag unless we've tried it with runc.

Would you be open to a PR that logs any error from os.Chtimes and continues?

@ddgenome I'm sorry, I don't understand your question. Are you asking if we can open a PR so you can test with the extra logging? Is a PR enough? New PRs don't result in any new images (until the PR is merged)

@ddgenome
Copy link
Author

@cvgw ,

I've tested using runc. Both kaniko executor 0.16.0 and 0.17.1 succeed under runc.

Would you be open to a PR that logs any error from os.Chtimes and continues?

Are you asking if we can open a PR so you can test with the extra logging? Is a PR enough? New PRs don't result in any new images (until the PR is merged)

I was asking if you would consider ignoring any error from os.Chtimes as an acceptable fix for this issue. I was not asking for additional testing, but as a solution to the current breakage. The reason I ask is that I would perfectly understand the answer "no", since it is not clear where the issue lies amongst kaniko, gVisor, and Go's os.Chtimes.

@cvgw
Copy link
Contributor

cvgw commented Feb 13, 2020

@cvgw ,

I've tested using runc. Both kaniko executor 0.16.0 and 0.17.1 succeed under runc.

Would you be open to a PR that logs any error from os.Chtimes and continues?

Are you asking if we can open a PR so you can test with the extra logging? Is a PR enough? New PRs don't result in any new images (until the PR is merged)

I was asking if you would consider ignoring any error from os.Chtimes as an acceptable fix for this issue. I was not asking for additional testing, but as a solution to the current breakage. The reason I ask is that I would perfectly understand the answer "no", since it is not clear where the issue lies amongst kaniko, gVisor, and Go's os.Chtimes.

I think that could be a reasonable fix, but ya I would want to better understand what the issue is before making any changes. I'm gonna try to get runc setup on my machine this week so that I can debug this.

@tejal29 tejal29 added priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. regression labels Feb 13, 2020
@cvgw
Copy link
Contributor

cvgw commented Feb 23, 2020

I was able to repro this after getting gvisor installed.

After some debugging I saw that the value of accesstime is the zero value for time.Time which I have a hunch is not compatible with os.Chtimes due to the requirements of the underlying syscall.

Working on a fix.

@cvgw
Copy link
Contributor

cvgw commented Feb 23, 2020

It does appear to be an issue with the zero value for time.Time.

When UnixNano is called on a zero time.Time, a negative number is returned. This is not a valid argument for the syscall that os.Chtimes hits.

https://play.golang.org/p/-e6FdyWcqp1

I believe the correct fix here would be to convert any zero time.Time to 1970-01-01 00:00:00Z as this will be the zero value for unix epoch

https://play.golang.org/p/SWoRZ9VTDtr

cvgw added a commit to cvgw/kaniko that referenced this issue Feb 23, 2020
The zero value of time.Time is not a valid argument to os.Chtimes
because of the syscall that os.Chtimes calls. Instead we update the zero
value of time.Time to 1970-01-01 00:00:00Z as this is the zero value of
Unix Epoch
cvgw added a commit to cvgw/kaniko that referenced this issue Feb 24, 2020
The zero value of time.Time is not a valid argument to os.Chtimes
because of the syscall that os.Chtimes calls. Instead we update the zero
value of time.Time to 1970-01-01 00:00:00Z as this is the zero value of
Unix Epoch
cvgw added a commit to cvgw/kaniko that referenced this issue Feb 24, 2020
The zero value of time.Time is not a valid argument to os.Chtimes
because of the syscall that os.Chtimes calls. Instead we update the zero
value of time.Time to the zero value of Unix Epoch
@cvgw cvgw added in progress and removed norepro core team can't replicate this issue, require more info labels Feb 24, 2020
cvgw added a commit to cvgw/kaniko that referenced this issue Feb 24, 2020
The zero value of time.Time is not a valid argument to os.Chtimes
because of the syscall that os.Chtimes calls. Instead we update the zero
value of time.Time to the zero value of Unix Epoch
tejal29 added a commit that referenced this issue Feb 25, 2020
@ddgenome
Copy link
Author

Thanks!

@cvgw
Copy link
Contributor

cvgw commented Feb 25, 2020

@ddgenome would you mind giving either tag 16e60cd6d498231751818a83cf3f7cfc45345ca7 or debug-16e60cd6d498231751818a83cf3f7cfc45345ca7 a try to confirm that we've got this fixed?

@bimargulies-google
Copy link

Would I be hitting this if I got...

INFO[0007] Checking for cached layer gcr.io/span-cloud-testing/alpine-emulator/cache:e1623f508fe5c859de003b35ddfb5ed68a56e1f0320c214d84cc39e778985f5d... 
INFO[0008] No cached layer found for cmd RUN gcc --version 
INFO[0008] Unpacking rootfs as cmd RUN gcc --version requires it. 
error building image: error building stage: failed to get filesystem from image: mkdir /usr/lib/jvm/default-jvm: file exists

@cvgw
Copy link
Contributor

cvgw commented Feb 26, 2020

@bimargulies-google no, you would see almost the exact same error as original reported except with maybe a different file path. That error looks more like #830

@ddgenome
Copy link
Author

@cvgw here are the results you requested:

runc runsc runsc-nightly runsc-head
kaniko 0.16.0 success success success success
kaniko 0.17.1 success fail fail fail
kaniko 16e60cd success success success success
kaniko debug-16e60cd success success success success

runsc is now release-20200219.0, runsc-nightly is release-20200211.0-39-g8dae8a10f01b, and runsc-head is release-20200219.0-46-ga92087f0f8fe.

The new version looks good, thanks again!

So it sounds like the difference is that the gVisor implementation of the system call underlying os.Chtimes does not handle negative values, while the same Linux system call must.

@ddgenome
Copy link
Author

Here is the script I used to test the various versions of kaniko against the various versions of gVisor: https://gist.github.com/ddgenome/fa1da223569fdb602592a87249c1bf10 , if that helps anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gvisor in progress priority/p0 Highest priority. Break user flow. We are actively looking at delivering it. regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants