fix leaking connections when user client closes connection #220

koonpeng · 2023-09-07T05:42:29Z

When the user client closes the connection, it is not propagated to the backend. For simple "one shot" requests, there is no huge impact as the response would just be dropped, however for long running requests (chunked encoding or websockets), the backend doesn't know that the user client has closed the connection and it keeps sending new updates to the relay client, which also doesn't know and so keep forwarding that to the relay server.

The fix applied is:

Add StopRelayRequest to the broker to "forget" a relaying request. This will cause the relay server to response with a permanent error the next time the relay client tries to send a response for that id. This will in turn cause the relay client to close the connection to the backend.
There is a bug in the relay client when checking for backoff permanent errors. When the backoff operation encounters a permanent error, it actually unwraps it and return the underlying error, but the client is still checking for backoff.Permanent, resulting in the check to always fail and never handling it.

Signed-off-by: Teo Koon Peng <koonpeng@google.com>

…ctions Signed-off-by: Teo Koon Peng <koonpeng@google.com>

Signed-off-by: Teo Koon Peng <koonpeng@google.com>

drigz

Thank you, this is a really great change! And also a helpful commit message. I will add you to the organization so you can push to the googlecloudrobotics/core repo as otherwise the presubmits will fail with a permissions error.

drigz · 2023-09-07T13:49:03Z

src/go/tests/relay/nok8s_relay_test.go

+			}
+		}
+	}))
+	defer func() { ts.Close() }()


I think this can be defer ts.Close()

drigz · 2023-09-07T13:54:47Z

src/go/tests/relay/nok8s_relay_test.go

+	if err := r.start(backendAddress); err != nil {
+		t.Fatal("failed to start relay: ", err)
+	}
+	defer func() {


I would just do defer r.stop() here. As far as I can tell, t.Fatal() inside a deferred function is not handled well: golang/go#29207

drigz · 2023-09-07T13:58:18Z

src/go/cmd/http-relay-server/server/broker.go

+// StopRelayRequest forgets a relaying request, this causes the next chunk from the backend
+// with the relay id to not be recognized, resulting in the relay server returning an error.
+func (r *broker) StopRelayRequest(requestId string) {
+	delete(r.resp, requestId)


Please hold the mutex while accessing the map, Go maps are not thread-safe.

drigz · 2023-09-07T14:18:54Z

src/go/cmd/http-relay-client/client/client.go

@@ -384,7 +396,7 @@ func (c *Client) postResponse(remote *http.Client, br *pb.HttpResponse) error {
 		return fmt.Errorf("couldn't read relay server's response body: %v", err)
 	}
 	if resp.StatusCode != http.StatusOK {
-		err := fmt.Errorf("relay server responded %s: %s", http.StatusText(resp.StatusCode), body)
+		err := NewRelayServerError(fmt.Sprintf("relay server responded %s: %s", http.StatusText(resp.StatusCode), body))
 		if resp.StatusCode == http.StatusBadRequest {
 			// http-relay-server may have restarted during the request.


Please extend this comment to say "or the client cancelled the request."

drigz · 2023-09-07T14:20:28Z

src/go/cmd/http-relay-client/client/client.go

@@ -643,8 +655,11 @@ func (c *Client) handleRequest(remote *http.Client, local *http.Client, pbreq *p
 				log.Printf("[%s] Failed to post response to relay: %v", *resp.Id, err)
 			},
 		)
-		if _, ok := err.(*backoff.PermanentError); ok {
+		if _, ok := err.(*RelayServerError); ok {


This is a change in behavior: Before, only HTTP 400 would terminate the loop, now anything other than HTTP 200 terminates the loop. For example, if the nginx ingress is overloaded and returns 500s for a while, this will now terminate the request. I think this is a good thing: There is no reason for us to drop a chunk from the middle of the response body, and it's better to terminate the request.

I think you could go even further: any error returned from RetryNotify indicates that we aren't sure we've posted the response, and that we should stop this loop rather than incorrectly continuing with the next response chunk. WDYT?

Please also delete the next comment which is no longer true (this could be a "transient" 5xx error that lasted too long). Maybe // The relay server was unreachable for too long, so we dropped the chunk and should abort the request.

That sounds good, silently dropping chunks will cause clients to receive corrupted data, it breaks the normal assumption of http/tcp connections.

koonpeng · 2023-09-13T08:47:20Z

Moved to #222

koonpeng · 2023-09-15T02:54:38Z

Thanks for the comments, I updated the other PR with the suggestions here.

Same as #220, but ported to this repo. --- When the user client closes the connection, it is not propagated to the backend. For simple "one shot" requests, there is no huge impact as the response would just be dropped, however for long running requests (chunked encoding or websockets), the backend doesn't know that the user client has closed the connection and it keeps sending new updates to the relay client, which also doesn't know and so keep forwarding that to the relay server. The fix applied is: * Add `StopRelayRequest` to the broker to "forget" a relaying request. This will cause the relay server to response with a permanent error the next time the relay client tries to send a response for that id. This will in turn cause the relay client to close the connection to the backend. * There is a bug in the relay client when checking for backoff permanent errors. When the backoff operation encounters a permanent error, it actually unwraps it and return the underlying error, but the client is still checking for `backoff.Permanent`, resulting in the check to always fail and never handling it. --------- Signed-off-by: Teo Koon Peng <koonpeng@google.com>

koonpeng added 4 commits September 7, 2023 10:34

wip

2ce39ec

Signed-off-by: Teo Koon Peng <koonpeng@google.com>

Merge remote-tracking branch 'origin/main' into kp/fix-unclosed-conne…

52eee09

…ctions Signed-off-by: Teo Koon Peng <koonpeng@google.com>

working for chunked http response

6dbe16b

Signed-off-by: Teo Koon Peng <koonpeng@google.com>

use defer to ensure connection is closed

b753c3c

Signed-off-by: Teo Koon Peng <koonpeng@google.com>

drigz reviewed Sep 7, 2023

View reviewed changes

koonpeng mentioned this pull request Sep 13, 2023

fix leaking connections when user client closes connection #222

Merged

koonpeng closed this Sep 13, 2023

koonpeng deleted the kp/fix-unclosed-connections branch September 15, 2023 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix leaking connections when user client closes connection #220

fix leaking connections when user client closes connection #220

koonpeng commented Sep 7, 2023

drigz left a comment

drigz Sep 7, 2023

drigz Sep 7, 2023

drigz Sep 7, 2023

drigz Sep 7, 2023

drigz Sep 7, 2023

koonpeng Sep 13, 2023

koonpeng commented Sep 13, 2023

koonpeng commented Sep 15, 2023

fix leaking connections when user client closes connection #220

fix leaking connections when user client closes connection #220

Conversation

koonpeng commented Sep 7, 2023

drigz left a comment

Choose a reason for hiding this comment

drigz Sep 7, 2023

Choose a reason for hiding this comment

drigz Sep 7, 2023

Choose a reason for hiding this comment

drigz Sep 7, 2023

Choose a reason for hiding this comment

drigz Sep 7, 2023

Choose a reason for hiding this comment

drigz Sep 7, 2023

Choose a reason for hiding this comment

koonpeng Sep 13, 2023

Choose a reason for hiding this comment

koonpeng commented Sep 13, 2023

koonpeng commented Sep 15, 2023