Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using gha cache with mode=max fails with 400 error #841

Closed
hertzg opened this issue Nov 14, 2021 · 39 comments
Closed

Using gha cache with mode=max fails with 400 error #841

hertzg opened this issue Nov 14, 2021 · 39 comments

Comments

@hertzg
Copy link

hertzg commented Nov 14, 2021

Hello, I have recently started using cahe-to "type=gha,mode=max,scope=..." to cache all the layers and the following error seemed to persist only for specific builds (consistently failing on the same). After removeing the mode=max the issues went away but obviousely not everything is cached.

Failing builds with mode=max:

Passing build after removing mode=max

#130 exporting cache
#130 preparing build cache for export
#130 preparing build cache for export 121.0s done
#130 writing layer sha256:06be072867b08bb9aef2e469533d0d0d6a85f97c2aabcaf5103d78c217977918
#130 writing layer sha256:06be072867b08bb9aef2e469533d0d0d6a85f97c2aabcaf5103d78c217977918 0.1s done
#130 writing layer sha256:08980434c63c0751ea69086b9c8282a5b2784f78643934cf62036ae07e94b943
#130 writing layer sha256:08980434c63c0751ea69086b9c8282a5b2784f78643934cf62036ae07e94b943 0.1s done
...
redacted to keep it short
...
#130 writing layer sha256:d91065fb02477295b9823f300711ac850313d3b6a81a6ca1f8d214f8700b8b2e
#130 writing layer sha256:d91065fb02477295b9823f300711ac850313d3b6a81a6ca1f8d214f8700b8b2e 4.0s done
#130 ERROR: error writing layer blob: error committing cache 37887: failed to parse error response 400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request</h2>
<hr><p>HTTP Error 400. The request is badly formed.</p>
</BODY></HTML>
: invalid character '<' looking for beginning of value
------
 > exporting cache:
------
error: failed to solve: error writing layer blob: error committing cache 37887: failed to parse error response 400: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">
<HTML><HEAD><TITLE>Bad Request</TITLE>
<META HTTP-EQUIV="Content-Type" Content="text/html; charset=us-ascii"></HEAD>
<BODY><h2>Bad Request</h2>
<hr><p>HTTP Error 400. The request is badly formed.</p>
</BODY></HTML>
: invalid character '<' looking for beginning of value
Error: buildx failed with: : invalid character '<' looking for beginning of value
@hertzg
Copy link
Author

hertzg commented Nov 16, 2021

Maybe @tonistiigi & @crazy-max Any hints or suggestions? I'd be happy to provide more info if required

@tonistiigi
Copy link
Member

@hertzg
Copy link
Author

hertzg commented Nov 18, 2021

@tonistiigi I could yes I'm not sure I know how exactly thou, I assume I need a built image with moby/buildx for that commit so I could use it with docker buildx create .... --driver-opt image=...? If so could you help with the buildkit image for that commit that I could pull?

@hertzg
Copy link
Author

hertzg commented Nov 23, 2021

@tonistiigi Gentle bump 😊

@tonistiigi
Copy link
Member

@hertzg I pushed an image with additional logs. tonistiigi/buildkit:gha-logs moby/buildkit@master...tonistiigi:gha-request-logs . Set it as buildkit image and report back. You also need to enable debug, if you use docker's actions then just enable the debug mode in github actions. If you call buildx create manually then you need to set --buildkitd-flags '--debug'

@hertzg
Copy link
Author

hertzg commented Nov 24, 2021

Thanks for the instructions, I've setup a branch with those changes (incl. actions debug), the build is running, this might take an hour to finish.

I'll report back with the results.

@hertzg
Copy link
Author

hertzg commented Nov 24, 2021

First attempt: failed with timeouts, I've restarted it to see if its flakiness, It seems to have passed on the 4'th retry.

@hertzg
Copy link
Author

hertzg commented Nov 25, 2021

@tonistiigi I managed reproduce it again I don't see any new error messages so far.

After manually adding the --debug to --buildkitd-flags logs appeared in post setup action.

2021-11-25T11:07:28.4664930Z 2021/11/25 11:07:26 > request POST https://artifactcache.actions.githubusercontent.com/7UerGt9E8dgwAyL2ogc3QGsbFOJ4WODvM1ysTtaU0T2efb087j/_apis/artifactcache/caches/49370 , err: <nil>
2021-11-25T11:07:28.4672333Z 2021/11/25 11:07:26 < response 429 Too Many Requests map[Activityid:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] Content-Length:[315] Content-Type:[application/json] Date:[Thu, 25 Nov 2021 11:07:26 GMT] Retry-After:[60] Strict-Transport-Security:[max-age=2592000] X-Cache:[CONFIG_NOCACHE] X-Frame-Options:[SAMEORIGIN] X-Msedge-Ref:[Ref A: A66EBD35FE94468991D34EC956C93CBB Ref B: ASHEDGE1519 Ref C: 2021-11-25T11:07:26Z] X-Ratelimit-Limit:[493] X-Ratelimit-Remaining:[0] X-Ratelimit-Reset:[1637838743] X-Ratelimit-Resource:[ArtifactCache/Short] X-Tfs-Processid:[2ef82cec-56f3-4c18-b269-d6d60d120383] X-Tfs-Serviceerror:[Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.] X-Tfs-Session:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] X-Vss-E2eid:[3db07705-7b00-43ef-b1bf-d9d6ce0d8a44] X-Vss-Senderdeploymentid:[5f814983-29e7-4f0f-d4b8-8016a66b281b]]
2021-11-25T11:07:28.4679020Z 2021/11/25 11:07:27 > request POST https://artifactcache.actions.githubusercontent.com/7UerGt9E8dgwAyL2ogc3QGsbFOJ4WODvM1ysTtaU0T2efb087j/_apis/artifactcache/caches/49370 , err: <nil>
2021-11-25T11:07:28.4681951Z 2021/11/25 11:07:27 < response 400 Bad Request map[Content-Length:[311] Content-Type:[text/html; charset=us-ascii] Date:[Thu, 25 Nov 2021 11:07:27 GMT]]
2021-11-25T11:07:28.4684755Z time="2021-11-25T11:07:27Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = error writing layer blob: error committing cache 49370: failed to parse error response 400: <!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01//EN\"\"http://www.w3.org/TR/html4/strict.dtd\">\r\n<HTML><HEAD><TITLE>Bad Request</TITLE>\r\n<META HTTP-EQUIV=\"Content-Type\" Content=\"text/html; charset=us-ascii\"></HEAD>\r\n<BODY><h2>Bad Request</h2>\r\n<hr><p>HTTP Error 400. The request is badly formed.</p>\r\n</BODY></HTML>\r\n: invalid character '<' looking for beginning of value\n"

Looks like maybe it's some kind of rate-limiting issue? 🤔 /cc @dhadka @chrispat

@tonistiigi
Copy link
Member

@dhadka @chrispat Could it be that after POST request gets hit by rate-limiting, the retry receives 400? Could it at least not return an HTML error? It seems to be missing all the API headers, as if some kind of proxy responded with 400.

@hertzg
Copy link
Author

hertzg commented Dec 1, 2021

@dhadka @chrispat gentle bump 😊

@chrispat
Copy link

chrispat commented Dec 1, 2021

@kmkumaran can you have some one take a look here?

@dhadka
Copy link

dhadka commented Dec 1, 2021

Is the retry logic safe? From what I can tell, it seems like the request is getting interrupted before getting to the cache API (we're seeing a Client Disconnected error).

I'm not a Go developer, but I do see it's creating a reader for the JSON body (link to code):

req, err := http.NewRequest("POST", c.url(fmt.Sprintf("caches/%d", id)), bytes.NewReader(dt))

Does it need to reset that Reader to the start of the content or make sure it's not closed?

My hunch is that it's not sending the correct JSON body on the retry and the request is subsequently getting interrupted. This is also probably why the 400 error says The request is badly formed.

@tonistiigi
Copy link
Member

@dhadka That could be the issue indeed. Thanks for the pointer. I'll update that logic.

@tonistiigi
Copy link
Member

tonistiigi/go-actions-cache#12

@hertzg
Copy link
Author

hertzg commented Dec 3, 2021

@tonistiigi Any builds you'd like me to try out?

@tonistiigi
Copy link
Member

moby/buildkit#2506

test build tonistiigi/buildkit:pr2506

@hertzg
Copy link
Author

hertzg commented Dec 4, 2021

I am no longer able to reproduce the issue with tonistiigi/buildkit:pr2506 buildkit image. 🎉

@hertzg
Copy link
Author

hertzg commented Dec 6, 2021

@tonistiigi I see that the PR has been merged and approved, Unfortunately I'm unaware of the release cyles of buildkit, could you hint me in the correct direction where I could find when to expect the next release that would include this change? 🙌

@tonistiigi
Copy link
Member

@hertzg We will probably not do another patch release just for this unless something else comes up as well. You can use the master branch image(or pin it to digest for safety). cc @crazy-max

@tonistiigi
Copy link
Member

fixed in moby/buildkit#2506

@hertzg
Copy link
Author

hertzg commented Dec 6, 2021

Thanks for all the great work everyone involved 🙌

For general reference this is the digest I'm going for

moby/buildkit@sha256:b8d8458bbab41b575614801826d3e0f50e93d8730aaf68431713e3bdb4378744

@hertzg
Copy link
Author

hertzg commented Dec 9, 2021

SInce I've started using the pinned digest with the changes, I noticed a lot of timeout errors.

With messages similar to:

buildx failed with: error: failed to solve: error writing layer blob: error committing cache 66034: maximum timeout reached
buildx failed with: error: failed to solve: maximum timeout reached

Is this an issue on the cache side or ... ?

Links:

@dhadka
Copy link

dhadka commented Dec 9, 2021

@hertzg Yes, this is on the cache side. I'm seeing requests getting rate limited. There's a limit on the number of caches that can be created every 5 minute window, and this action can exceed that if the number of layers and/or matrix build is large. We're looking to increase that limit as long as it doesn't impact the health of the cache service (which I don't expect it to, but just want to be careful 😄)

@tonistiigi
Copy link
Member

error committing cache 66034: maximum timeout reached

Atm the cache library will retry with a backoff but if the service does not return in 5 min then you get an error like this.

@hertzg
Copy link
Author

hertzg commented Dec 10, 2021

@dhadka Thanks for the detailed explanation, what would be the recommended solution here other than reducing the amount of jobs?

@tonistiigi As I understand it now, the whole build is getting marked as failed because the rate limiting window and total timeout are very close to each other (both 5min). Would increasing the timeout help and not make the build fail?

@hertzg
Copy link
Author

hertzg commented Dec 14, 2021

As of current state, I had to disable it as it's just failing the builds and always needs restarts😞

@bishal-pdMSFT
Copy link

@hertzg we will be looking into relaxing the rate limit. But we need to carefully evaluate the extra load it will bring on the system. I will update you with an ETA.

@hertzg
Copy link
Author

hertzg commented Jan 9, 2022

Hello agian from the new Year, (hopefully) after all the festivities :)
@bishal-pdMSFT Where can we track the progress for the limit increase evaluation?

@bishal-pdMSFT
Copy link

@hertzg apologize for the delay, but the good new is that we are rolling out a change to increase the rate limit threshold. This should allow buildx action to cache layers with higher level of parallelization. cc @t-dedah to keep you in loop when rollout is complete.

Would love to hear back on whether this helps in reducing the 429s.

@t-dedah
Copy link

t-dedah commented Mar 11, 2022

@hertzg We have increased the RU limit from 450 to 700 on all the rings. This should reduce the number 429s significantly.

We saw a significant reduction in 429 after increasing the RU limit as shown in below images.

Before

1

After

2

@hertzg
Copy link
Author

hertzg commented Mar 12, 2022

@t-dedah I'm seeing timeouts and builds still failing right after enabling mode=max

https://github.com/hertzg/rtl_433_docker/runs/5523758504?check_suite_focus=true

@t-dedah
Copy link

t-dedah commented Mar 15, 2022

Hi @hertzg There is still a limit on number of requests we can make to seal and reserve cache. We have almost doubled that limit but if any action breaches that limit, we will still get a 429.
The current limit is set to 700 for both seal and reserve cache so 350 for each seal and reserve cache. If you have can help with a number which you think will be enough for mode=max then we can test and see how to get as close as possible to that limit.

The reasoning behind the new 700 limit

@tonistiigi
Copy link
Member

tonistiigi commented Mar 15, 2022

@t-dedah @dhadka I'll try to give some more background on where I think our requests come from.

Let's say the user is doing a build that touches, for example, 50 blobs/layers in total(with Github cache users usually would export all intermediate layers as well, not only the final result layers so this can grow fast). If that build is mostly cached, for example updated only 1 layer we will create a new "manifest" blob with links to these 50 layers. We will push the manifest and new layer, but most importantly we still need to make a request for all the old 49 blobs just to check if they still exist. We will make a request for all of them, Github will answer that record exists and we can continue. This needs to happen for all builds and some repositories make a lot of builds in a single workflow or very complex(multi-platform) builds with lots of layers. Even if lots of builds mostly share same layers we need to check them all on each build.

If there would be an endpoint we could use just to check that a cache key exists that would not be rate-limited(or with a much higher limit) it would be much less likely to hit these limits. That endpoint does not need to provide a download link or reserve a key like the current requests do. It could also be a batch endpoint to check multiple keys together. Maybe even just endpoint to list all current keys would be manageable as keys should be small and not take much room even if there are a lot of them.

Another way would be for us to somehow remember that the cache key existed and not check it more often than some timeframe. But for that, we would need some kind of external storage/database where we could keep these records.

The reasoning behind the new 700 limit

Looks like this is a private repository.

@dhadka
Copy link

dhadka commented Mar 17, 2022

If there would be an endpoint we could use just to check that a cache key exists that would not be rate-limited(or with a much higher limit) it would be much less likely to hit these limits.

If you're using the cache toolkit module, then read and writes are handled in separate calls (restoreCache vs saveCache). But restoreCache checks if the cache exists + downloads the content.

Inside restoreCache, it makes a call to cacheHttpClient.getCacheEntry that can be used to just check if the record exists:

  const cacheEntry = await cacheHttpClient.getCacheEntry(keys, paths, {
    compressionMethod
  })
  if (!cacheEntry?.archiveLocation) {
    // Cache not found
    return undefined
  }

Although this is getting into some of the internal implementation. Perhaps we need to add another top-level function alongside restoreCache and saveCache that just checks if the cache exists.

This GET operation also has a higher rate limit (900). Although now that the seal/reserve limit was increased (300 -> 700), it's not much higher 😄 @t-dedah can we please look into increasing this limit?

@tonistiigi
Copy link
Member

I think this is the same https://github.com/tonistiigi/go-actions-cache/blob/master/api.md#get-cache we already use. We don't actually pull the cache in the case described above but the request we make contains a download link. I'm not sure if adding this download link there is a reason you need to rate-limit this request.

@t-dedah
Copy link

t-dedah commented May 12, 2022

Hi @tonistiigi We already have a GET call which is also being used in Load function, so will it be possible for you to use that to check if the layer exists and only then make a reserve call if necessary. We can also increase the RU limit for GET in order to handle the increased load with this change.

@tonistiigi
Copy link
Member

I believe this is the same call I linked earlier and I think this is the one that most likely gets the limit error as these are the most common requests we send.

The way to avoid these requests would be to add an endpoint that would return all existing cache keys for a scope so we don't need to check them individually. Or add another GET style request that wouldn't return a download link and would instead allow higher limits(as this endpoint should be safer as it can't cause excess traffic).

Additional question: Are the limits per workflow step or more global? I assume they are for the whole workflow/repo/user. If they are not we might be able to save some requests by caching the "exist" results locally in case there are repeated builds inside the same workflow step.

@t-dedah
Copy link

t-dedah commented May 12, 2022

I believe this is the same call I linked earlier and I think this is the one that most likely gets the limit error as these are the most common requests we send.

The logs shared above shows that 429 is due to the POST call.

Are the limits per workflow step or more global?

The limit is for a particular user which would mean its linked to the repo rather than to a workflow run.

The way to avoid these requests would be to add an endpoint that would return all existing cache keys for a scope so we don't need to check them individually. Or add another GET style request that wouldn't return a download link and would instead allow higher limits(as this endpoint should be safer as it can't cause excess traffic).

The addition of a new API might take some time but we can try increasing the limit for GET call. Ill get back to you on this one after dicussing with my team.

@t-dedah
Copy link

t-dedah commented May 17, 2022

Hi @tonistiigi

Summary:-

  1. You can use GET api to check if the cache exists so we dont have to make multiple Reserve calls. The limit for GET is 900 which should we enough and we can also increase this further, if needed. If the cache key doesnt exist then you can reserve that layer only.

  2. In the future we will also have a LIST api which will further reduce the number of REST calls needed. This API is in planning stage at the moment.

  3. We are also increasing the limit for reserve api to 700

mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023
mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023
mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023
mnapoli added a commit to brefphp/aws-lambda-layers that referenced this issue Jan 17, 2023
phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023
phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023
phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023
phoenix933 pushed a commit to phoenix933/aws-lambda-layers that referenced this issue Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants