Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix leaking connections when user client closes connection #220

Closed

Conversation

koonpeng
Copy link
Contributor

@koonpeng koonpeng commented Sep 7, 2023

When the user client closes the connection, it is not propagated to the backend. For simple "one shot" requests, there is no huge impact as the response would just be dropped, however for long running requests (chunked encoding or websockets), the backend doesn't know that the user client has closed the connection and it keeps sending new updates to the relay client, which also doesn't know and so keep forwarding that to the relay server.

The fix applied is:

  • Add StopRelayRequest to the broker to "forget" a relaying request. This will cause the relay server to response with a permanent error the next time the relay client tries to send a response for that id. This will in turn cause the relay client to close the connection to the backend.
  • There is a bug in the relay client when checking for backoff permanent errors. When the backoff operation encounters a permanent error, it actually unwraps it and return the underlying error, but the client is still checking for backoff.Permanent, resulting in the check to always fail and never handling it.

Signed-off-by: Teo Koon Peng <koonpeng@google.com>
…ctions

Signed-off-by: Teo Koon Peng <koonpeng@google.com>
Signed-off-by: Teo Koon Peng <koonpeng@google.com>
Signed-off-by: Teo Koon Peng <koonpeng@google.com>
Copy link
Contributor

@drigz drigz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a really great change! And also a helpful commit message. I will add you to the organization so you can push to the googlecloudrobotics/core repo as otherwise the presubmits will fail with a permissions error.

}
}
}))
defer func() { ts.Close() }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be defer ts.Close()

if err := r.start(backendAddress); err != nil {
t.Fatal("failed to start relay: ", err)
}
defer func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just do defer r.stop() here. As far as I can tell, t.Fatal() inside a deferred function is not handled well: golang/go#29207

// StopRelayRequest forgets a relaying request, this causes the next chunk from the backend
// with the relay id to not be recognized, resulting in the relay server returning an error.
func (r *broker) StopRelayRequest(requestId string) {
delete(r.resp, requestId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please hold the mutex while accessing the map, Go maps are not thread-safe.

@@ -384,7 +396,7 @@ func (c *Client) postResponse(remote *http.Client, br *pb.HttpResponse) error {
return fmt.Errorf("couldn't read relay server's response body: %v", err)
}
if resp.StatusCode != http.StatusOK {
err := fmt.Errorf("relay server responded %s: %s", http.StatusText(resp.StatusCode), body)
err := NewRelayServerError(fmt.Sprintf("relay server responded %s: %s", http.StatusText(resp.StatusCode), body))
if resp.StatusCode == http.StatusBadRequest {
// http-relay-server may have restarted during the request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend this comment to say "or the client cancelled the request."

@@ -643,8 +655,11 @@ func (c *Client) handleRequest(remote *http.Client, local *http.Client, pbreq *p
log.Printf("[%s] Failed to post response to relay: %v", *resp.Id, err)
},
)
if _, ok := err.(*backoff.PermanentError); ok {
if _, ok := err.(*RelayServerError); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behavior: Before, only HTTP 400 would terminate the loop, now anything other than HTTP 200 terminates the loop. For example, if the nginx ingress is overloaded and returns 500s for a while, this will now terminate the request. I think this is a good thing: There is no reason for us to drop a chunk from the middle of the response body, and it's better to terminate the request.

I think you could go even further: any error returned from RetryNotify indicates that we aren't sure we've posted the response, and that we should stop this loop rather than incorrectly continuing with the next response chunk. WDYT?

Please also delete the next comment which is no longer true (this could be a "transient" 5xx error that lasted too long). Maybe // The relay server was unreachable for too long, so we dropped the chunk and should abort the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good, silently dropping chunks will cause clients to receive corrupted data, it breaks the normal assumption of http/tcp connections.

@koonpeng
Copy link
Contributor Author

Moved to #222

@koonpeng koonpeng closed this Sep 13, 2023
@koonpeng koonpeng deleted the kp/fix-unclosed-connections branch September 15, 2023 02:54
@koonpeng
Copy link
Contributor Author

Thanks for the comments, I updated the other PR with the suggestions here.

drigz pushed a commit that referenced this pull request Sep 15, 2023
Same as #220, but ported to this repo.

---

When the user client closes the connection, it is not propagated to the
backend. For simple "one shot" requests, there is no huge impact as the
response would just be dropped, however for long running requests
(chunked encoding or websockets), the backend doesn't know that the user
client has closed the connection and it keeps sending new updates to the
relay client, which also doesn't know and so keep forwarding that to the
relay server.

The fix applied is:

* Add `StopRelayRequest` to the broker to "forget" a relaying request.
This will cause the relay server to response with a permanent error the
next time the relay client tries to send a response for that id. This
will in turn cause the relay client to close the connection to the
backend.
* There is a bug in the relay client when checking for backoff permanent
errors. When the backoff operation encounters a permanent error, it
actually unwraps it and return the underlying error, but the client is
still checking for `backoff.Permanent`, resulting in the check to always
fail and never handling it.

---------

Signed-off-by: Teo Koon Peng <koonpeng@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants