Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backoff/timeout does not take into account actual GRPC connection established, causes denial of service. #954

Closed
cdesousa opened this issue Oct 28, 2016 · 23 comments · Fixed by #1340

Comments

@cdesousa
Copy link

cdesousa commented Oct 28, 2016

When I connect a GRPC client to an IP and port number which has no listeners the client properly tries to reconnect and respects the dial timeout and fallback.

However if I run an socat -v TCP-LISTEN:8002,fork SYSTEM:'echo hi' and then attempt to connect my grpc-go client to it... it'll keep reconnecting at a fast rate in what seems an infinite loop, effectively DoS'ing the host its running on since TCP connections stay around with TIME_WAIT for tcp_fin_timeout duration.

WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
WARN[0008] transport: http2Client.notifyError got notified that the client transport was broken unexpected EOF.
[infinite]

This can cause issues when using a TCP load balancer like ELB which always has a TCP listen port open regardless of whether there are any backend instances available as the connection is immediately closed after an accept.

Even in the case where someone isn't running a load balancer, if for any reason the server it tries to connect misbehaves it will cause the above.

Note: I'm using grpc.WithTimeout and grpc.WithBlock.

@cdesousa cdesousa changed the title backoff/timeout does not take into account actual GRPC connection established. backoff/timeout does not take into account actual GRPC connection established, causes denial of service. Oct 28, 2016
@jellevandenhooff
Copy link

jellevandenhooff commented Oct 31, 2016

This issue is the same as in #120, but @cdesousa's here description is much better.

@gm42
Copy link

gm42 commented Oct 31, 2016

@jellevandenhooff in #120 the main topic is how annoying the message are, however here I instead see that:

  • grpc-go has "smart" functionality to handle reconnection with goroutines
  • such functionality does not cover the case of a (unarguably misbehaving) server accepting and immediately closing connections

Since grpc-go wants to automate reconnection, I think it should cover also the case of this type of misbehaving servers to prevent self-DoS scenarios. It should be possible to cover not only the dialing stage but the reconnection stage (think about the IRC "reconnecting too fast" error message).

@gm42
Copy link

gm42 commented Oct 31, 2016

I have dug a bit into this, it turns out that the tight-looping (DoS-ing) behaviour is to be found in the transportMonitor() goroutine, which does not take in account an EOF-serving service.

Steps to reproduce the problem:

  1. start a gRPC example server on port 8081
  2. connect the client
  3. stop the server (e.g. Ctrl+C)
  4. start a dummy EOF-server in place of it: socat -v TCP-LISTEN:8081,fork SYSTEM:'echo hi'

The client will generate an avalanche of reconnections, quickly exhausting all file descriptors.

@jellevandenhooff
Copy link

@iamqizhao, what do you think of adding logic to the transport monitor to back-off before attempting to reconnect again if the first read in the reader goroutine fails?

@ahmetb
Copy link

ahmetb commented Feb 22, 2017

+1 this is still an issue. I have a very basic use case of gRPC and the log output is flooded with this message every 4 minutes while it is sitting idle. This looks confusing for someone who just started using gRPC and therefore warrants a Google search which eventually lands people here. I am still wondering if I’m doing something wrong.

2017/02/22 09:37:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:41:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:45:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:49:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:53:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/02/22 09:57:14 transport: http2Client.notifyError got notified that the client transport was broken EOF.

@MakMukhi
Copy link
Contributor

MakMukhi commented Feb 23, 2017 via email

@ahmetb
Copy link

ahmetb commented Feb 23, 2017

seems like the underlying transport breaks due to something

I think we need to dig this one deeper. I understand the logger work is going on but this one might be pointing to an actual problem.

This is basically just me running multiple grpc services on my laptop and it starts happening after a while such as 2 minutes after services don't get any requests:

DEBU[2017-02-23T12:06:08-08:00] request completed                             elapsed=1.512785ms method=POST path="/coffee" service=web

2017/02/23 12:08:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.

@jba
Copy link

jba commented Mar 17, 2017

@MakMukhi Do you have an ETA and/or gRPC version number for when the log spam will cease?

@dfawley dfawley assigned dfawley and menghanl and unassigned dfawley Mar 20, 2017
@montanaflynn
Copy link

I'm getting these logs when using the gcloud vision client (cloud.google.com/go/vision)

transport: http2Client.notifyError got notified that the client transport was broken EOF.

@dmonay
Copy link

dmonay commented May 17, 2017

I am also getting these logs when using Google's Cloud Speech and Cloud Translate APIs.

Is there an update on this?

@ahmetb
Copy link

ahmetb commented Jun 12, 2017

I have an easy repro for this, I just initialized one of cloud.google.com/go libraries, did not make any requests, and created an interactive prompt to read in from the user, and just waited for a couple of minutes in the prompt.

Simplest repro:

import iamadmin "cloud.google.com/go/iam/admin/apiv1"

client,err := iamadmin.NewIamClient(ctx)
if err != nil {panic(err)}

scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {fmt.Println(scanner.Text())} // <-- do not hit Return when you run the program
if err := scanner.Err(); err != nil {panic(err)}

shortly after, you'll start seeing:

2017/06/12 12:37:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/06/12 12:41:39 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...

@menghanl
Copy link
Contributor

The logging API changes are in: #922
We will get it merged by end of next week.

After this, we will change the gRPC code to use the correct log level.

@jellevandenhooff
Copy link

@menghanl any updates on the zero-delay reconnect logic?

@menghanl
Copy link
Contributor

@jellevandenhooff
If I remember correctly, the zero-delay reconnect issue only happens with a misbehaving server. It should not be an issue in a correctly configured environment.
Also, in the current gRPC code structure, it's not trivial to add backoff in this situation.
Considering the effort needed to fix and the severity of the problem, we are putting it at a low priority.

The logging issue is more annoying for users, so we will get that fixed soon.

@ahmetb
Copy link

ahmetb commented Jun 19, 2017

Let me know if you need a repro program. I am assuming the fix will get rid of this error message without any change in the caller code. If that's not the case, we're not fixing it right.

@jellevandenhooff
Copy link

Hi @menghanl, I don't think #1340 fixes this issue. Though it helps with the log messages, the file descriptors in @gm42's comment still get exhausted.

@menghanl
Copy link
Contributor

Sorry, reopened.

@menghanl menghanl reopened this Jul 13, 2017
@BernhardValenti
Copy link

2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42024 - 405 0B in 24.254µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42026 - 405 0B in 12.321µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42028 - 405 0B in 50.646µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42030 - 405 0B in 20.908µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42032 - 405 0B in 10.143µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42034 - 405 0B in 16.86µs
2017/07/17 15:12:28 stdout: "PRI http://* HTTP/2.0" from 127.0.0.1:42036 - 405 0B in 14.035µs```

I'm using the chi framework and I see a flood of these when I enable ```router.Use(middleware.Logger)``` in my server and a backend GRPC service it uses is offline.

@dfawley
Copy link
Member

dfawley commented Jul 17, 2017

We discussed this a bit today and weren't sure exactly what the right behavior is in this case. As mentioned above, backoff applies only to cases where the server does not respond, not to the case where it can be reached but then immediately fails. Related, the pick-first balancer, which restarts from the top of the list of backends upon any backend failure, would continually pick this same problem server if it is at the top. We will need to do some more investigation on this.

@vtubati
Copy link

vtubati commented Sep 16, 2017

A side effect (it seems) of these rapid retries, leaving in hundreds of lingering TCP sessions, in case of a mis-behaving remote server (which in our case was due to mis-configuration, pointing to a wrong server) is GC getting triggered every 15secs or so and taking up to 70% of a core for about 15secs. Below is a snip of perf report. Any way to expedite fix for the issue?:
Samples: 49K of event 'cycles', Event count (approx.): 21149591915
29.95% emsd emsd [.] runtime.scanobject
21.11% emsd emsd [.] runtime.memclrNoHeapPointers
10.52% emsd emsd [.] runtime.greyobject
9.67% emsd emsd [.] runtime.heapBitsForObject
2.76% emsd emsd [.] runtime.gcDrain
1.52% emsd emsd [.] runtime.scanblock
[snip]

@dfawley
Copy link
Member

dfawley commented Sep 18, 2017

This and similar problems have come up a few times lately, so we would like to do something soon. The tentative plan is:

  1. Add a dial option to allow the user to choose to wait until a settings frame is received from the server before considering the connection established. This is what the java client does, and it will prevent this problem when connecting to a server that is accepting connections, but isn't a gRPC endpoint or isn't healthy enough to send settings frames.

  2. Make this option opt-out after one release, instead of opt-in.

  3. Remove this option after one more release.

If a server is healthy enough to send a settings frame, and then disconnects, and that happens repeatedly...I'm not sure there's much we can do about that without changes to the gRPC spec.

One incompatibility with this behavior that I'm aware of is cmux -- there is a workaround, but it would require a small code change. If anyone knows of other possible problems with this approach, please let us know.

@vtubati
Copy link

vtubati commented Sep 18, 2017

Thanks for the details of the fix being thought about. Wonder if you are in a position to share when the patch is likely to be available. Thanks.

@dfawley
Copy link
Member

dfawley commented Sep 19, 2017

Let's consolidate this with #1444.

@dfawley dfawley closed this as completed Sep 19, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.