storage: High spontaneous latencies #753

lhecker · 2017-09-08T14:40:26Z

We're currently facing issues with high latencies on GCS operations for 10-30 minutes a time, multiple times a day, causing timeouts on our APIs. Most issues surprisingly seem to happen with delete operations, even though they should be the easiest to process.

99th percentile
95th percentile
90th percentile
50th percentile (for comparison)

Our upload/delete code is similar to this:

func Upload(ctx context.Context, path string, src []byte) error {
	wc := s.bucket.Object(path).NewWriter(ctx)
	wc.ContentType = contentType
	wc.CacheControl = "max-age=31556926" // 1 year

	wc.Write(src)
	return wc.Close()
}

func Delete(ctx context.Context, path string) error {
	return s.bucket.Object(path).Delete(ctx)
}

Are you aware of anyone else having these troubles? Or do you have a solution for this?
I personally suspect these issues to stem from my request for #701, which - due to retrying on REFUSED_STREAM, causes those high latencies (tenfold even!). Could that be the case? And if so: Could it be solved somehow?

The text was updated successfully, but these errors were encountered:

jba · 2017-09-08T16:12:04Z

No one else has mentioned this problem (yet).

Could you instrument shouldRetry to see if your theory is correct?

lhecker · 2017-09-11T16:17:46Z

@jba I just deployed our service with an instrumented shouldRetry function. I‘ll keep you posted. 🙂

lhecker · 2017-09-12T15:36:19Z

@jba As already suspected it indeed is caused by spontaneous excessive amounts of REFUSED_STREAM errors. In our case 40% of uploads and 10% of all deletes fail that way, or 25 and 5 requests per second respectively.
The retrying that was added sadly does not solve the problem - it only prolongs it until the deadline runs out. For instance the above deletes run with a deadline of 25s, which is still seemingly easily exceeded.
Our applications are furthermore even running within the Google Container Engine by the way.

Do you have ideas as to why that is the case, or how it could be solved?
Could it be an infrastructure problem on your side? Something like your GCS load-balancer/front-proxy being overloaded? Because HTTP/2 is using fewer connections than HTTP/1?
Either way this is a very uncomfortable problem since sometimes 40% of all our requests fail to complete within an arguably generous deadline of 5s. 😕

jba · 2017-09-12T20:27:41Z

Bringing it up internally. Will keep you posted.

chasmastin · 2017-09-15T21:14:33Z

We are seeing similar problems with uploads, high rate of failure that can spike up to 30% (but probably hovers around 1-5%). Our code looks something like:

our timeout is long, 2min, but still see failures.

ctx, _ := context.WithTimeout(context.Background(), time.Duration(config.GcsTimedContextSec)*time.Second)
writer := object.NewWriter(ctx)
writer.ChunkSize = 0
for i := 0; i < config.GcsWriteRetries; i++ {
if _, err := writer.Write(body); err != nil {
log.WithFields(
logrus.Fields{
"reason": err.Error(),
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Error(errGcsWrite)
continue
} else {
log.WithFields(
logrus.Fields{
"reason": "wrote to gcs",
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Info(errGcsWrite)
}
return
}

chasmastin · 2017-12-11T11:21:57Z

Is there any update or progress on this? we are still seeing a very high level of uploads fail with REFUSED_STREAM, and very high latencies as suggested above.

lhecker · 2017-12-11T11:27:49Z

@chasmastin We "solved" the issue in the interim, by keeping a running average of failed GCS requests. If that value is too high we forcefully instantiate a new connection pool.
You can do so by providing the GCS library with your own http.Transport using the option.WithHTTPClient() option each time, causing a new HTTP pool to be created and used.

This did not solve the underlying issue though of course.

jonomacd · 2018-03-06T10:39:07Z

We have seen somewhat similar things with the bigtable library. This uses GRPC. We have been struggling with occasional latency for ~30 minutes where our upper percentiles timeout. We often see this across hosts client, but have occasionally seen it against one particular host.

We have been heavily instrumenting the connection logic and the request logic. It is very hard to pinpoint what is actually going on. For bigtable, our fear was either large rows or hot keys but this does not appear to be the case. Our current main suspect is the GRPC connection. I wonder if we are seeing similar issues manifested in different API's.

jcgregorio · 2018-10-14T20:03:16Z

We are seeing the same behavior in BigTable, in the graph below you can see the periods of high latency. We have a 1 minute timeout set, so it never exceeds 60s, but these periods of high latency are occurring fairly often. Not that all of the requests plotted here are BigTable reads, not writes.

jba · 2018-10-15T13:06:13Z

Interesting fact: storage uses JSON/HTTP, while bigtable uses gRPC. So if these are related, they are at a deeper level—the Go networking stack, perhaps, or something in Google's network infrastructure.

It would be useful to know if these same issues crop up in other languages' clients.

jeanbza · 2018-10-15T14:51:50Z

Off-the-cuff observation: if the BigTable client maintains long-lived connections, then those spikes are probably the GFE killing connections every hour (WAI).

redacted: gRPC should handle connection closures gracefully and with minimal downtime. Nevermind. :)

lhecker · 2018-10-15T15:39:48Z

@jba At my company we solved the initial issue above by recreating the GCS client and subsequently its http.Transport whenever the error rate is >10% in the last 100 requests.
This simple heuristic triggers around 1-2 times per hour per instance in average, but mostly solved the perceived issue in the meantime.

But in fact we indeed have more issues like that with other services for which we didn't find any easy solution yet. For instance sometimes PubSub clients of our low-traffic services simply stop receiving any messages at all for up to multiple minutes at a time - far more than the client-side timeout of 60s.

(That's probably not related and I hope I'm not mistaken, but... Every single day at some time each database in our Spanner instances increase their respective CPU usage. Which is kinda understandable, since they're probably running some kind of DB compaction etc. But what's interesting is that each single day at some consistently random other time Spanner just "decides" to have a 3-10x (not a typo) increase in latency for a few minutes without any increase in OPS or CPU usage. Strange stuff. 😄)

sduskis · 2018-12-06T18:41:47Z

At a high level, I feel like these issues are best discussed with the product teams via their support channels. Latency is pretty hard to debug, and I don't think that the problems are specific to GoLang. I think that we should have some good documentation about who to talk to about performance issues in a centralized place. I do think that we ought to close this issue, since there isn't anything actionable from the client team.

@jadekler, we did some research on the Cloud Bigtable side. There's an interesting client-side issue with a subtle interaction between connection pooling, low QPS clients, connection refreshes and retries-with-exponential-backoff. That issue ought to be discussed off line, and in the context of performance testing and fixes.

enocom · 2018-12-10T23:40:01Z

I suspect this issue is another manifestation of golang/go#27208, which we discovered in #1211. In short, it's an http/2 bug, which has since been fixed and will arrive in Go 1.11.3.

I'm going to close this. If the problem appears again with Go 1.11.3, feel free to re-open.

lhecker · 2018-12-15T10:07:44Z

FYI: The (supposed) fix for this issue was recently released in Go 1.11.4 instead. 🙂

anthmgoogle added the type: question Request for information or clarification. Not an issue. label Nov 10, 2017

anthmgoogle assigned jba Nov 10, 2017

jeanbza added api: storage Issues related to the Cloud Storage API. and removed api: storage Issues related to the Cloud Storage API. labels Mar 6, 2018

jba assigned jeanbza and unassigned jba Sep 14, 2018

sethvargo mentioned this issue Sep 28, 2018

Errors and high latency when using GCS backend under (medium) load hashicorp/vault#5419

Closed

jba added api: bigtable Issues related to the Bigtable API. api: storage Issues related to the Cloud Storage API. labels Oct 15, 2018

enocom mentioned this issue Nov 6, 2018

Canceled GCS Multipart upload leaks HTTP2 streams and leads to REFUSED_STREAM errors #1211

Closed

jeanbza assigned enocom and unassigned jeanbza Dec 10, 2018

enocom closed this as completed Dec 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: High spontaneous latencies #753

storage: High spontaneous latencies #753

lhecker commented Sep 8, 2017

jba commented Sep 8, 2017

lhecker commented Sep 11, 2017

lhecker commented Sep 12, 2017 •

edited

Loading

jba commented Sep 12, 2017

chasmastin commented Sep 15, 2017 •

edited

Loading

chasmastin commented Dec 11, 2017

lhecker commented Dec 11, 2017

jonomacd commented Mar 6, 2018

jcgregorio commented Oct 14, 2018

jba commented Oct 15, 2018

jeanbza commented Oct 15, 2018 •

edited

Loading

lhecker commented Oct 15, 2018 •

edited

Loading

sduskis commented Dec 6, 2018

enocom commented Dec 10, 2018

lhecker commented Dec 15, 2018

storage: High spontaneous latencies #753

storage: High spontaneous latencies #753

Comments

lhecker commented Sep 8, 2017

jba commented Sep 8, 2017

lhecker commented Sep 11, 2017

lhecker commented Sep 12, 2017 • edited Loading

jba commented Sep 12, 2017

chasmastin commented Sep 15, 2017 • edited Loading

chasmastin commented Dec 11, 2017

lhecker commented Dec 11, 2017

jonomacd commented Mar 6, 2018

jcgregorio commented Oct 14, 2018

jba commented Oct 15, 2018

jeanbza commented Oct 15, 2018 • edited Loading

lhecker commented Oct 15, 2018 • edited Loading

sduskis commented Dec 6, 2018

enocom commented Dec 10, 2018

lhecker commented Dec 15, 2018

lhecker commented Sep 12, 2017 •

edited

Loading

chasmastin commented Sep 15, 2017 •

edited

Loading

jeanbza commented Oct 15, 2018 •

edited

Loading

lhecker commented Oct 15, 2018 •

edited

Loading