Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: High spontaneous latencies #753

Closed
lhecker opened this issue Sep 8, 2017 · 15 comments
Closed

storage: High spontaneous latencies #753

lhecker opened this issue Sep 8, 2017 · 15 comments
Assignees
Labels
api: bigtable Issues related to the Bigtable API. api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@lhecker
Copy link
Contributor

lhecker commented Sep 8, 2017

We're currently facing issues with high latencies on GCS operations for 10-30 minutes a time, multiple times a day, causing timeouts on our APIs. Most issues surprisingly seem to happen with delete operations, even though they should be the easiest to process.

  • 99th percentile
    bildschirmfoto 2017-09-08 um 16 24 25
  • 95th percentile
    bildschirmfoto 2017-09-08 um 16 25 13
  • 90th percentile
    bildschirmfoto 2017-09-08 um 16 25 31
  • 50th percentile (for comparison)
    bildschirmfoto 2017-09-08 um 16 25 56

Our upload/delete code is similar to this:

func Upload(ctx context.Context, path string, src []byte) error {
	wc := s.bucket.Object(path).NewWriter(ctx)
	wc.ContentType = contentType
	wc.CacheControl = "max-age=31556926" // 1 year

	wc.Write(src)
	return wc.Close()
}

func Delete(ctx context.Context, path string) error {
	return s.bucket.Object(path).Delete(ctx)
}

Are you aware of anyone else having these troubles? Or do you have a solution for this?
I personally suspect these issues to stem from my request for #701, which - due to retrying on REFUSED_STREAM, causes those high latencies (tenfold even!). Could that be the case? And if so: Could it be solved somehow?

@jba
Copy link
Contributor

jba commented Sep 8, 2017

No one else has mentioned this problem (yet).

Could you instrument shouldRetry to see if your theory is correct?

@lhecker
Copy link
Contributor Author

lhecker commented Sep 11, 2017

@jba I just deployed our service with an instrumented shouldRetry function. I‘ll keep you posted. 🙂

@lhecker
Copy link
Contributor Author

lhecker commented Sep 12, 2017

@jba As already suspected it indeed is caused by spontaneous excessive amounts of REFUSED_STREAM errors. In our case 40% of uploads and 10% of all deletes fail that way, or 25 and 5 requests per second respectively.
The retrying that was added sadly does not solve the problem - it only prolongs it until the deadline runs out. For instance the above deletes run with a deadline of 25s, which is still seemingly easily exceeded.
Our applications are furthermore even running within the Google Container Engine by the way.

Do you have ideas as to why that is the case, or how it could be solved?
Could it be an infrastructure problem on your side? Something like your GCS load-balancer/front-proxy being overloaded? Because HTTP/2 is using fewer connections than HTTP/1?
Either way this is a very uncomfortable problem since sometimes 40% of all our requests fail to complete within an arguably generous deadline of 5s. 😕

@jba
Copy link
Contributor

jba commented Sep 12, 2017

Bringing it up internally. Will keep you posted.

@chasmastin
Copy link

chasmastin commented Sep 15, 2017

We are seeing similar problems with uploads, high rate of failure that can spike up to 30% (but probably hovers around 1-5%). Our code looks something like:

our timeout is long, 2min, but still see failures.

ctx, _ := context.WithTimeout(context.Background(), time.Duration(config.GcsTimedContextSec)*time.Second)
writer := object.NewWriter(ctx)
writer.ChunkSize = 0
for i := 0; i < config.GcsWriteRetries; i++ {
if _, err := writer.Write(body); err != nil {
log.WithFields(
logrus.Fields{
"reason": err.Error(),
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Error(errGcsWrite)
continue
} else {
log.WithFields(
logrus.Fields{
"reason": "wrote to gcs",
"uri": uri,
"bodySize": len(body),
"retries": config.GcsWriteRetries,
"bucketName": config.HlsBucket,
"attempt": i,
},
).Info(errGcsWrite)
}
return
}

@anthmgoogle anthmgoogle added the type: question Request for information or clarification. Not an issue. label Nov 10, 2017
@chasmastin
Copy link

Is there any update or progress on this? we are still seeing a very high level of uploads fail with REFUSED_STREAM, and very high latencies as suggested above.

@lhecker
Copy link
Contributor Author

lhecker commented Dec 11, 2017

@chasmastin We "solved" the issue in the interim, by keeping a running average of failed GCS requests. If that value is too high we forcefully instantiate a new connection pool.
You can do so by providing the GCS library with your own http.Transport using the option.WithHTTPClient() option each time, causing a new HTTP pool to be created and used.

This did not solve the underlying issue though of course.

@jonomacd
Copy link

jonomacd commented Mar 6, 2018

We have seen somewhat similar things with the bigtable library. This uses GRPC. We have been struggling with occasional latency for ~30 minutes where our upper percentiles timeout. We often see this across hosts client, but have occasionally seen it against one particular host.

We have been heavily instrumenting the connection logic and the request logic. It is very hard to pinpoint what is actually going on. For bigtable, our fear was either large rows or hot keys but this does not appear to be the case. Our current main suspect is the GRPC connection. I wonder if we are seeing similar issues manifested in different API's.

@jeanbza jeanbza added api: storage Issues related to the Cloud Storage API. and removed api: storage Issues related to the Cloud Storage API. labels Mar 6, 2018
@jba jba assigned jeanbza and unassigned jba Sep 14, 2018
@jcgregorio
Copy link

We are seeing the same behavior in BigTable, in the graph below you can see the periods of high latency. We have a 1 minute timeout set, so it never exceeds 60s, but these periods of high latency are occurring fairly often. Not that all of the requests plotted here are BigTable reads, not writes.

image

@jba jba added api: bigtable Issues related to the Bigtable API. api: storage Issues related to the Cloud Storage API. labels Oct 15, 2018
@jba
Copy link
Contributor

jba commented Oct 15, 2018

Interesting fact: storage uses JSON/HTTP, while bigtable uses gRPC. So if these are related, they are at a deeper level—the Go networking stack, perhaps, or something in Google's network infrastructure.

It would be useful to know if these same issues crop up in other languages' clients.

@jeanbza
Copy link
Member

jeanbza commented Oct 15, 2018

Off-the-cuff observation: if the BigTable client maintains long-lived connections, then those spikes are probably the GFE killing connections every hour (WAI).

redacted: gRPC should handle connection closures gracefully and with minimal downtime. Nevermind. :)

@lhecker
Copy link
Contributor Author

lhecker commented Oct 15, 2018

@jba At my company we solved the initial issue above by recreating the GCS client and subsequently its http.Transport whenever the error rate is >10% in the last 100 requests.
This simple heuristic triggers around 1-2 times per hour per instance in average, but mostly solved the perceived issue in the meantime.

But in fact we indeed have more issues like that with other services for which we didn't find any easy solution yet. For instance sometimes PubSub clients of our low-traffic services simply stop receiving any messages at all for up to multiple minutes at a time - far more than the client-side timeout of 60s.

(That's probably not related and I hope I'm not mistaken, but... Every single day at some time each database in our Spanner instances increase their respective CPU usage. Which is kinda understandable, since they're probably running some kind of DB compaction etc. But what's interesting is that each single day at some consistently random other time Spanner just "decides" to have a 3-10x (not a typo) increase in latency for a few minutes without any increase in OPS or CPU usage. Strange stuff. 😄)

@sduskis
Copy link

sduskis commented Dec 6, 2018

At a high level, I feel like these issues are best discussed with the product teams via their support channels. Latency is pretty hard to debug, and I don't think that the problems are specific to GoLang. I think that we should have some good documentation about who to talk to about performance issues in a centralized place. I do think that we ought to close this issue, since there isn't anything actionable from the client team.

@jadekler, we did some research on the Cloud Bigtable side. There's an interesting client-side issue with a subtle interaction between connection pooling, low QPS clients, connection refreshes and retries-with-exponential-backoff. That issue ought to be discussed off line, and in the context of performance testing and fixes.

@jeanbza jeanbza assigned enocom and unassigned jeanbza Dec 10, 2018
@enocom
Copy link
Member

enocom commented Dec 10, 2018

I suspect this issue is another manifestation of golang/go#27208, which we discovered in #1211. In short, it's an http/2 bug, which has since been fixed and will arrive in Go 1.11.3.

I'm going to close this. If the problem appears again with Go 1.11.3, feel free to re-open.

@enocom enocom closed this as completed Dec 10, 2018
@lhecker
Copy link
Contributor Author

lhecker commented Dec 15, 2018

FYI: The (supposed) fix for this issue was recently released in Go 1.11.4 instead. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the Bigtable API. api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

9 participants