backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

cdesousa · 2016-10-28T14:56:28Z

When I connect a GRPC client to an IP and port number which has no listeners the client properly tries to reconnect and respects the dial timeout and fallback.

However if I run an socat -v TCP-LISTEN:8002,fork SYSTEM:'echo hi' and then attempt to connect my grpc-go client to it... it'll keep reconnecting at a fast rate in what seems an infinite loop, effectively DoS'ing the host its running on since TCP connections stay around with TIME_WAIT for tcp_fin_timeout duration.

WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
[infinite]

This can cause issues when using a TCP load balancer like ELB which always has a TCP listen port open regardless of whether there are any backend instances available as the connection is immediately closed after an accept.

Even in the case where someone isn't running a load balancer, if for any reason the server it tries to connect misbehaves it will cause the above.

Note: I'm using grpc.WithTimeout and grpc.WithBlock.

The text was updated successfully, but these errors were encountered:

jellevandenhooff · 2016-10-31T03:56:22Z

This issue is the same as in #120, but @cdesousa's here description is much better.

gm42 · 2016-10-31T08:57:02Z

@jellevandenhooff in #120 the main topic is how annoying the message are, however here I instead see that:

grpc-go has "smart" functionality to handle reconnection with goroutines
such functionality does not cover the case of a (unarguably misbehaving) server accepting and immediately closing connections

Since grpc-go wants to automate reconnection, I think it should cover also the case of this type of misbehaving servers to prevent self-DoS scenarios. It should be possible to cover not only the dialing stage but the reconnection stage (think about the IRC "reconnecting too fast" error message).

gm42 · 2016-10-31T22:46:37Z

I have dug a bit into this, it turns out that the tight-looping (DoS-ing) behaviour is to be found in the transportMonitor() goroutine, which does not take in account an EOF-serving service.

Steps to reproduce the problem:

start a gRPC example server on port 8081
connect the client
stop the server (e.g. Ctrl+C)
start a dummy EOF-server in place of it: socat -v TCP-LISTEN:8081,fork SYSTEM:'echo hi'

The client will generate an avalanche of reconnections, quickly exhausting all file descriptors.

jellevandenhooff · 2016-12-13T22:08:04Z

@iamqizhao, what do you think of adding logic to the transport monitor to back-off before attempting to reconnect again if the first read in the reader goroutine fails?

ahmetb · 2017-02-22T18:04:44Z

+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.

2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.

MakMukhi · 2017-02-23T18:34:51Z

We will be adding verbosity levels to logging soon. Error like that won't spam logs after that. No plans yet to support cases where the server or client misbehaves. In this particular case, seems like the underlying transport breaks due to something (may be the environment configuration doesn't allow for long-living ideal connections or something). However, the aforementioned monitor routine reconnects.

…

On Wed, Feb 22, 2017 at 10:04 AM, Ahmet Alp Balkan ***@***.*** > wrote: +1 this is still an issue. I have a very simple use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong. 2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. 2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. 2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. 2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. 2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. 2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#954 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATtnR6xwfuoJ8xHMLzi5a5vENRZ3BRC7ks5rfHjCgaJpZM4Kjh0O> .

ahmetb · 2017-02-23T20:09:06Z

seems like the underlying transport breaks due to something

I think we need to dig this one deeper. I understand the logger work is going on but this one might be pointing to an actual problem.

This is basically just me running multiple grpc services on my laptop and it starts happening after a while such as 2 minutes after services don't get any requests:

DEBU[2017-02-23T12:06:08-08:00] request completed                             elapsed=1.512785ms method=POST path="/coffee" service=web

2017/02/23 12:08:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.

jba · 2017-03-17T14:33:46Z

@MakMukhi Do you have an ETA and/or gRPC version number for when the log spam will cease?

montanaflynn · 2017-04-04T19:58:21Z

I'm getting these logs when using the gcloud vision client (cloud.google.com/go/vision)

transport: http2Client.notifyError got notified that the client transport was broken EOF.

dmonay · 2017-05-17T15:49:15Z

I am also getting these logs when using Google's Cloud Speech and Cloud Translate APIs.

Is there an update on this?

ahmetb · 2017-06-12T19:42:23Z

I have an easy repro for this, I just initialized one of cloud.google.com/go libraries, did not make any requests, and created an interactive prompt to read in from the user, and just waited for a couple of minutes in the prompt.

Simplest repro:

import iamadmin "cloud.google.com/go/iam/admin/apiv1"

client,err := iamadmin.NewIamClient(ctx)
if err != nil {panic(err)}

scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {fmt.Println(scanner.Text())} // <-- do not hit Return when you run the program
if err := scanner.Err(); err != nil {panic(err)}

shortly after, you'll start seeing:

2017/06/12 12:37:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/06/12 12:41:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...

menghanl · 2017-06-12T20:36:25Z

The logging API changes are in: #922
We will get it merged by end of next week.

After this, we will change the gRPC code to use the correct log level.

jellevandenhooff · 2017-06-12T20:43:37Z

@menghanl any updates on the zero-delay reconnect logic?

menghanl · 2017-06-12T21:13:44Z

@jellevandenhooff
If I remember correctly, the zero-delay reconnect issue only happens with a misbehaving server. It should not be an issue in a correctly configured environment.
Also, in the current gRPC code structure, it's not trivial to add backoff in this situation.
Considering the effort needed to fix and the severity of the problem, we are putting it at a low priority.

The logging issue is more annoying for users, so we will get that fixed soon.

ahmetb · 2017-06-19T03:39:50Z

Let me know if you need a repro program. I am assuming the fix will get rid of this error message without any change in the caller code. If that's not the case, we're not fixing it right.

jellevandenhooff · 2017-07-13T19:47:03Z

Hi @menghanl, I don't think #1340 fixes this issue. Though it helps with the log messages, the file descriptors in @gm42's comment still get exhausted.

menghanl · 2017-07-13T20:32:25Z

Sorry, reopened.

BernhardValenti · 2017-07-17T15:16:55Z

2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42024 - 405 0B in 24.254µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42026 - 405 0B in 12.321µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42028 - 405 0B in 50.646µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42030 - 405 0B in 20.908µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42032 - 405 0B in 10.143µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42034 - 405 0B in 16.86µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42036 - 405 0B in 14.035µs```

I'm using the chi framework and I see a flood of these when I enable ```router.Use(middleware.Logger)``` in my server and a backend GRPC service it uses is offline.

dfawley · 2017-07-17T23:08:43Z

We discussed this a bit today and weren't sure exactly what the right behavior is in this case. As mentioned above, backoff applies only to cases where the server does not respond, not to the case where it can be reached but then immediately fails. Related, the pick-first balancer, which restarts from the top of the list of backends upon any backend failure, would continually pick this same problem server if it is at the top. We will need to do some more investigation on this.

vtubati · 2017-09-16T02:27:58Z

A side effect (it seems) of these rapid retries, leaving in hundreds of lingering TCP sessions, in case of a mis-behaving remote server (which in our case was due to mis-configuration, pointing to a wrong server) is GC getting triggered every 15secs or so and taking up to 70% of a core for about 15secs. Below is a snip of perf report. Any way to expedite fix for the issue?:
Samples: 49K of event 'cycles', Event count (approx.): 21149591915
29.95% emsd emsd [.] runtime.scanobject
21.11% emsd emsd [.] runtime.memclrNoHeapPointers
10.52% emsd emsd [.] runtime.greyobject
9.67% emsd emsd [.] runtime.heapBitsForObject
2.76% emsd emsd [.] runtime.gcDrain
1.52% emsd emsd [.] runtime.scanblock
[snip]

dfawley · 2017-09-18T16:21:24Z

This and similar problems have come up a few times lately, so we would like to do something soon. The tentative plan is:

Add a dial option to allow the user to choose to wait until a settings frame is received from the server before considering the connection established. This is what the java client does, and it will prevent this problem when connecting to a server that is accepting connections, but isn't a gRPC endpoint or isn't healthy enough to send settings frames.
Make this option opt-out after one release, instead of opt-in.
Remove this option after one more release.

If a server is healthy enough to send a settings frame, and then disconnects, and that happens repeatedly...I'm not sure there's much we can do about that without changes to the gRPC spec.

One incompatibility with this behavior that I'm aware of is cmux -- there is a workaround, but it would require a small code change. If anyone knows of other possible problems with this approach, please let us know.

vtubati · 2017-09-18T20:13:27Z

Thanks for the details of the fix being thought about. Wonder if you are in a position to share when the patch is likely to be available. Thanks.

dfawley · 2017-09-19T20:01:00Z

Let's consolidate this with #1444.

cdesousa changed the title ~~backoff/timeout does not take into account actual GRPC connection established.~~ backoff/timeout does not take into account actual GRPC connection established, causes denial of service. Oct 28, 2016

jellevandenhooff mentioned this issue Dec 13, 2016

client fails annoyingly with lots of log messages when server does not speak grpc #120

Closed

hqiansc mentioned this issue Mar 14, 2017

http2Client.notifyError got notified that the client transport was broken EOF causes pubsub to stop working googleapis/google-cloud-go#556

Closed

dfawley assigned dfawley and menghanl and unassigned dfawley Mar 20, 2017

hsaliak added bug P1 labels May 17, 2017

dghubble mentioned this issue May 23, 2017

matchbox: provider test poseidon/terraform-provider-matchbox#11

Merged

menghanl mentioned this issue Jun 13, 2017

Add Severity and VerboseLevel to grpclog. #922

Merged

menghanl mentioned this issue Jun 28, 2017

Use log severity and verbosity level #1340

Merged

pongad mentioned this issue Jul 5, 2017

pubsub: not acked quickly googleapis/google-cloud-go#684

Closed

menghanl closed this as completed in #1340 Jul 13, 2017

menghanl reopened this Jul 13, 2017

lgarron mentioned this issue Aug 14, 2017

Cloud Console logs are flooded with transport: http2Client.notifyError got notified that the client transport was broken EOF. chromium/hstspreload.org#89

Closed

dfawley added Type: Bug and removed Type: Bug labels Aug 24, 2017

tsuna mentioned this issue Sep 18, 2017

busy loop in transportMonitor #1525

Closed

dfawley closed this as completed Sep 19, 2017

dfawley added the Status: Duplicate label Sep 19, 2017

lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

cdesousa commented Oct 28, 2016 •

edited

Loading

jellevandenhooff commented Oct 31, 2016 •

edited

Loading

gm42 commented Oct 31, 2016

gm42 commented Oct 31, 2016 •

edited

Loading

jellevandenhooff commented Dec 13, 2016

ahmetb commented Feb 22, 2017 •

edited

Loading

MakMukhi commented Feb 23, 2017 via email

ahmetb commented Feb 23, 2017

jba commented Mar 17, 2017

montanaflynn commented Apr 4, 2017

dmonay commented May 17, 2017

ahmetb commented Jun 12, 2017 •

edited

Loading

menghanl commented Jun 12, 2017

jellevandenhooff commented Jun 12, 2017

menghanl commented Jun 12, 2017

ahmetb commented Jun 19, 2017

jellevandenhooff commented Jul 13, 2017

menghanl commented Jul 13, 2017

BernhardValenti commented Jul 17, 2017

dfawley commented Jul 17, 2017

vtubati commented Sep 16, 2017

dfawley commented Sep 18, 2017

vtubati commented Sep 18, 2017

dfawley commented Sep 19, 2017

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

Comments

cdesousa commented Oct 28, 2016 • edited Loading

jellevandenhooff commented Oct 31, 2016 • edited Loading

gm42 commented Oct 31, 2016

gm42 commented Oct 31, 2016 • edited Loading

jellevandenhooff commented Dec 13, 2016

ahmetb commented Feb 22, 2017 • edited Loading

MakMukhi commented Feb 23, 2017 via email

ahmetb commented Feb 23, 2017

jba commented Mar 17, 2017

montanaflynn commented Apr 4, 2017

dmonay commented May 17, 2017

ahmetb commented Jun 12, 2017 • edited Loading

menghanl commented Jun 12, 2017

jellevandenhooff commented Jun 12, 2017

menghanl commented Jun 12, 2017

ahmetb commented Jun 19, 2017

jellevandenhooff commented Jul 13, 2017

menghanl commented Jul 13, 2017

BernhardValenti commented Jul 17, 2017

dfawley commented Jul 17, 2017

vtubati commented Sep 16, 2017

dfawley commented Sep 18, 2017

vtubati commented Sep 18, 2017

dfawley commented Sep 19, 2017

cdesousa commented Oct 28, 2016 •

edited

Loading

jellevandenhooff commented Oct 31, 2016 •

edited

Loading

gm42 commented Oct 31, 2016 •

edited

Loading

ahmetb commented Feb 22, 2017 •

edited

Loading

ahmetb commented Jun 12, 2017 •

edited

Loading