Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streams] Question About Graceful Stream Termination #6504

Closed
moh-osman3 opened this issue Aug 3, 2023 · 16 comments
Closed

[Streams] Question About Graceful Stream Termination #6504

moh-osman3 opened this issue Aug 3, 2023 · 16 comments

Comments

@moh-osman3
Copy link

What is the issue?

Currently using grpc-go library to perform streaming RPCs between two components (Opentelemetry Collectors) that have span/metric instrumentation. I'm wondering what is the proper way for stream to be terminated if there is instrumentation involved? I'm seeing errors returned from the server when the stream is terminated instead of getting an OK status.

What I'm currently seeing:

Seeing NO_ERROR or UNKNOWN error codes returned from the server.

We first noticed an issue where streams are being closed on the server side when the server's keepalive settings reaches a max_connection_age+ grace. This causes the server to return NO_ERROR to the client and this error shows up on spans from the server. This is not ideal because this adds a noisy/misleading signal in many spans that there is an issue.

On the other hand we tried to close the stream on the client side by calling client.CloseSend() before the max_connection_age is reached in the server, which prevents the server from sending NO_ERROR. Instead we are now getting EOF with codes.Unknown, which is still not ideal.

My question:

Our instrumentation is currently using the grpc interceptor pattern and we are trying to determine a couple things.

  1. When the stream is closed by gRPC because of the keepalive settings... why is an error NO_ERROR returned rather than OK status? i.e. can we improve on the NO_ERROR?
  2. When the client handles shutdown instead to try and be graceful, what is the best way for the client to shutdown and inspect the Code() to ensure that spans are OK?

Overall our goal is for stream shutdowns to be graceful and have no errors. Any thoughts? Thank you!

@jmacd
Copy link

jmacd commented Aug 3, 2023

Adding a few relevant details on this, since I'm part of @moh-osman3's investigation.

The streaming client we use is defined here: https://github.com/f5/otel-arrow-adapter/tree/main/collector/gen/exporter/otlpexporter/internal/arrow

The streaming server we use is defined here:
https://github.com/f5/otel-arrow-adapter/tree/main/collector/gen/receiver/otlpreceiver/internal/arrow

It's working quite well, we just haven't been able to figure out how to terminate the streams in such a way that the instrumentation produced by the instrumentation package.

As a user of the gRPC library, I was expecting that each side of the stream could call CloseSend() and the other would receive some kind of EOF response to indicate healthy termination, but we haven't been able to find a reliable way to do that. Thanks!

@easwars
Copy link
Contributor

easwars commented Aug 8, 2023

We first noticed an issue where streams are being closed on the server side when the server's keepalive settings reaches a max_connection_age+ grace. This causes the server to return NO_ERROR to the client and this error shows up on spans from the server. This is not ideal because this adds a noisy/misleading signal in many spans that there is an issue.

When the max_connection_age is reached, the server tries to gracefully close the connection. This is done by sending a GOAWAY frame with streamID sent to MaxUint32. It then waits for the grace period to expire before sending another GOAWAY frame (this time with the streamID set to the value of the highest stream that was processed) and the connection is hard closed.

Is this not what you are seeing when the max_connection_age expires?

Also, what version of gRPC are you using?

  • We recently made a change to include some debug data in the GOAWAY frame sent by the server to the client during graceful shutdown and this should be seen on the client. And with this change, you should see the RPC on the client failing with UNAVAILABLE:
    st = status.Newf(codes.Unavailable, "closing transport due to: %v, received prior goaway: %v", err, goAwayDebugMessage)

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Aug 15, 2023
@moh-osman3
Copy link
Author

moh-osman3 commented Aug 16, 2023

Thanks for the response @easwars!

Also, what version of gRPC are you using?

We seem to be using grpc v1.56.0 https://github.com/f5/otel-arrow-adapter/blob/main/go.mod#L60

RPC on the client failing with UNAVAILABLE

To us, the meaning of graceful is that both the client and server see no errors, and that when the server has a keepalive setting the client and server are able to coordinate to shutdown cleanly. We aren't sure how gRPC expects us to do this, and we need to be able to do this in a way that standard tracing instrumentation sees as a clean exit.

The exact status we are seeing is

status.Code(): Unavailable
status.Message(): closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "max_age"

This leaves us with two questions:

  1. Is there a way the client can somehow learn about the first GOAWAY? We haven't found any documentation or APIs having to do with this. With this signal, we expect the client to call CloseSend() to initiate the stream shutdown, and wait for all replies it expects, and frankly unsure what should happen next?

  2. We need to teach the OTel-Go-gRPC instrumentation, which is based on gRPC interceptors, how to recognize a clean shutdown and report "OK" status for the corresponding span, whereas without special support today we would need to modify the instrumentation to handle Unavailable w/ a message substring containing NO_ERROR as a special case. If the gRPC team actually recommends handling "Unavailable with a message substring containing NO_ERROR" as the officially recommended way of doing that, that's fine but is there a document y'all can provide explaining this stream shutdown process? This will be helpful to point to so we can propose to Otel-Go-gRPC the addition of a special case handler logic so the client can recognize a clean shutdown and report "OK" instead of an error when shutting down.

@dfawley
Copy link
Member

dfawley commented Aug 21, 2023

Is there a way the client can somehow learn about the first GOAWAY? We haven't found any documentation or APIs having to do with this. With this signal, we expect the client to call CloseSend() to initiate the stream shutdown, and wait for all replies it expects, and frankly unsure what should happen next?

No, sorry. It's not possible to distinguish between a server shutting down due to max age vs. another kind of connection loss.

We need to teach the OTel-Go-gRPC instrumentation, which is based on gRPC interceptors, how to recognize a clean shutdown and report "OK" status for the corresponding span

Why do you need this?

Long-lived streams killed by the connection being lost (including via max age) will be terminated with a non-OK status. This is not a "clean shutdown", because the connection was hard-closed out from underneath it, meaning it could have been in the middle of receiving data/etc. It is proper to classify it as an error.

If the gRPC team actually recommends handling "Unavailable with a message substring containing NO_ERROR" as the officially recommended way of doing that, that's fine

We definitely do not recommend relying upon the strings contained in errors to be used for anything besides manual debugging.

@jmacd
Copy link

jmacd commented Aug 22, 2023

Is there a way the client can somehow learn about the first GOAWAY?

No, sorry. It's not possible to distinguish between a server shutting down due to max age vs. another kind of connection loss.

@dfawley -- We asked a different question. If the stream client could learn about the first GOAWAY, then it could responsibly initiate a clean shutdown for itself. We are looking for a way the client can detect a GOAWAY as application-level code. Is there a signal that gRPC can deliver to the client code? How else can we describe the disconnect as "graceful" if the client can't see it coming and sees a connection loss? We'll do whatever it takes, but it doesn't look like gRPC gives the client a way to know about the keepalive, and we're not sure what else to do to close a stream without data loss.

We need to teach the OTel-Go-gRPC instrumentation, which is based on gRPC interceptors, how to recognize a clean shutdown and report "OK" status for the corresponding span

Why do you need this?

After we resolve the question above, about how we are meant to shut down streaming RPCs without data loss, we need to be sure that it will look like a non-error in the instrumentation. When we resolve the first question above, the question will be:

  • under what conditions does a stream client generate codes.OK when the RPC ends? I assume it must exit nil in the client. We will be able to arrange for this after we get a GOAWAY signal, we think.
  • under what conditions does a stream server generate codes.OK when the RPC ends? I assume it must exit nil, but possibly the client has to call CloseSend() first. Looking for guidance.

@dfawley
Copy link
Member

dfawley commented Aug 22, 2023

We asked a different question. If the stream client could learn about the first GOAWAY, then it could responsibly initiate a clean shutdown for itself.

Yes, that's what I was answering. The client cannot distinguish between a server shutting down due to max age vs. another kind of connection loss. Note that if there is a proxy involved, it's possible to have one client using many streams on a single connection that are actually being routed to multiple backend servers. So there would have to be something in the gRPC stream protocol to transmit this information to the client, but nothing like that exists, or (probably?) is even possible to add.

Server-side, it may be possible to add an API that indicates to all the streams on a connection that they will be terminated, but this would be a pretty significant cross-language effort, with a relatively low benefit compared to our other priorities at the moment. Feel free to propose something more concrete on our list or at the main gRPC repo, but unfortunately we probably would not be able to prioritize it very highly.

How else can we describe the disconnect as "graceful" if the client can't see it coming and sees a connection loss?

The word "graceful" is used because it allows a "grace period" for clients to finish their open RPCs.

Generally, the advice we give is to make your long-lived streams have a finite amount of time that they can last, and to set your grace timeout to be longer than that time.

@dfawley dfawley closed this as completed Aug 22, 2023
@jmacd
Copy link

jmacd commented Aug 22, 2023

The word "graceful" is used because it allows a "grace period" for clients to finish their open RPCs.

How does it allow that for streaming RPCs? I think that is the problem here, we have no way to know to stop starting new RPCs when there has been a GOAWAY received. I don't see why this requires server changes, just a signal to the client to start winding down?

Generally, the advice we give is to make your long-lived streams have a finite amount of time that they can last, and to set your grace timeout to be longer than that time.

I don't see how allows a "grace period" for clients to finish their open RPCs. is true. Can you clarify how we're to do that?

@dfawley
Copy link
Member

dfawley commented Aug 22, 2023

How does it allow that for streaming RPCs?

Streaming RPCs and long-lived streams are two different topics. You're apparently talking about long-lived streams, which means this does not work nicely for them. The same thing would be true of a long-poll unary RPC as well. If the RPC's deadline is longer than the keepalive grace time, or infinite, then it will ultimately be hard killed when the grace period is over.

I think that is the problem here, we have no way to know to stop starting new RPCs when there has been a GOAWAY received.

The grpc.ClientConn itself will stop using that connection for new RPCs once a GOAWAY has been received, and open a new connection instead. This already is happening without you being aware of it.

I don't see why this requires server changes, just a signal to the client to start winding down?

The issue is that the client can't reliably get this signal if it goes through a proxy, since it's a signal on the connection and not the stream. (Proxies can forward different streams for a single incoming connection to multiple outgoing connections.)

I don't see how allows a "grace period" for clients to finish their open RPCs. is true. Can you clarify how we're to do that?

By setting a maximum time limit on all your RPCs to less than your grace period. This is the only way to do that right now.

@moh-osman3
Copy link
Author

By setting a maximum time limit on all your RPCs to less than your grace period. This is the only way to do that right now.

@dfawley So I think @jmacd and I have tried this, but wondering if maybe we are not doing this correctly. On the stream client we set a max_stream_lifetime which is less than the grace period, at which point the client calls CloseSend and waits for a response from the server using Recv. However our issue is still that the server sends a different error

Now we are seeing a

status.Message(): EOF
status.Code(): codes.Unknown

Is there any possible way for us to set up the server to send a status of codes.OK in this case, where the client has called CloseSend? i.e. how can we client-initiate a stream shutdown with no errors?

@jmacd
Copy link

jmacd commented Aug 22, 2023

long-lived streams

Yes, we want streams to live as long as the server will allow because the longer they live the more compression benefit we get. As you said, proxies could lower the timeout so the client will have to guess at the maximum lifetime it should use.

Well, since we're here -- why not use Go's context to solve this problem? The client could easily select on a context associated with the stream if there were one available.

@dfawley
Copy link
Member

dfawley commented Aug 22, 2023

Is there any possible way for us to set up the server to send a status of codes.OK in this case, where the client has called CloseSend? i.e. how can we client-initiate a stream shutdown with no errors?

Definitely CloseSend() should be able to initiate a happy stream termination. The server should be doing a stream.Recv() and if that returns io.EOF, it means the client did a CloseSend(), and the server's method handler should terminate ASAP with a nil error to indicate OK status. Maybe your server is just returning the io.EOF directly?

why not use Go's context to solve this problem? The client could easily select on a context associated with the stream if there were one available.

It's not an API-level concern here. The issue is that there is no stream-level signal that can go from the server to the client when it's ready to shut down the connection; GOAWAYs are used for this and they are conneciton-level. The presence of proxies would undermine this approach. And even if you aren't using a proxy today, you might decide to one day in the future. So if you were to design around this and then wanted to add proxies, you'd be unable to.

@jmacd
Copy link

jmacd commented Aug 22, 2023

I'm still confused by

The issue is that there is no stream-level signal that can go from the server to the client when it's ready to shut down the connection; GOAWAYs are used for this and they are conneciton-level.

The client's gRPC logs contain the first GOAWAY, so the gRPC library or its subordinate libraries know about the GOAWAY. What is preventing connecting these signals? I.e., I don't see how this has to go from the server to the client -- the client just wants a way to know before the connection is terminated and the client has already logged about a GOAWAY event. The client has some information, only the stream doesn't?

@dfawley
Copy link
Member

dfawley commented Aug 22, 2023

Yes, the client knows about the GOAWAY. But not if it's behind a proxy that doesn't forward the GOAWAY from the server, which it wouldn't do if it was multiplexing streams to multiple backend servers, which is pretty normal.

I don't see how this has to go from the server to the client

The server is the thing deciding to cycle the connection. It's the thing sending the GOAWAY.

  • the client just wants a way to know before the connection is terminated and the client has already logged about a GOAWAY event.

Yes, we could technically plumb any GOAWAYs the client receives to the streams on that connection. The problem is that that is not a reliable way to determine a stream will be closed soon anymore if you put it behind a proxy. So if you designed around that and then added a proxy, your design wouldn't work anymore.

@jmacd
Copy link

jmacd commented Aug 24, 2023

Yes, we could technically plumb any GOAWAYs the client receives to the streams on that connection. The problem is that that is not a reliable way to determine a stream will be closed soon anymore if you put it behind a proxy.

Can you explain this in more detail? In the present configuration with no proxy, I'm seeing the GOAWAY in the client's logs. If there were a proxy, would there not be a GOAWAY signal delivered? I can't understand how the proxy would break this design, it would just shorten the effective keepalive -- as long as a GOAWAY signal is delivered to the client, the client can end streams gracefully and reconnect more often. Presumably this would shorten the maximum connection lifetime, but the long-lived RPC would see the signal and be able to shut itself down sooner--and still gracefully.

I've looked into the code a bit more, and there is a potential problem that I see with your recommendation to limit RPC duration to less than the stream max_connection_age_grace setting. gRPC automatically applies jitter to max_connection_age, but not to max_connection_age_grace. I will no longer benefit from gRPC's automatic jitter, so to avoid the risk of a connection storm, I will have to apply similar jitter on the client. I can do that, but I still don't know how to recommend configuring the max_connection_age setting.

These streams will be used heavily during their lifetime, and we do not expect to see a benefit from connection sharing. We are setting a keepalive to ensure that load is rebalanced, and since there is no benefit from connection sharing, I am inclined to set max_connection_age to 0 while setting max_connection_age_grace to the actual limit, per your advice. However, setting max_connection_age to 0 means unlimited, and I haven't been able to determine what a safe, small value for max_connection_age is from reading the code. We are going to advise a configuration like this:

	keepalive:
          server_parameters:
            max_connection_age: 1m
            max_connection_age_grace: 10m

but it will require a fair bit of explaining why the age is shorter than the grace period, and still it requires a completely separate mechanism for the client to implement its own timeout. From a system design perspective, it would be a lot nicer if the client could get this signal from the connection (i.e., the server or the intermediate proxy).

The recommendation will read as follows:

As a gRPC streaming service, the OTel Arrow receiver is able to limit
stream lifetime through configuration of the underlying http/2
connection via keepalive settings.

Keepalive settings are vital to the operation of OTel Arrow, because
longer-lived streams use more memory and streams are fixed to a single
host. Since every stream of data is different, we recommend
experimenting to find a good balance between memory usage, stream
lifetime, and load balance.

gRPC libraries do not build-in a facility for long-lived RPCs to learn
about impending http/2 connection state changes, including the event
that initiates connection reset. While the receiver knows its own
keepalive settings, a shorter maximum conenction lifetime can be
imposed by intermediate http/2 proxies, and therefore the receiver and
exporter are expected to independently configure these limits.

receivers:
  otelarrow:
    protocols:
      grpc:
        keepalive:
          server_parameters:
            max_connection_age: 1m
            max_connection_age_grace: 10m

In the example configuration above, OTel-Arrow streams will have reset
initiated after 10 minutes. Note that max_connection_age is set to
a small value and we recommend tuning max_connection_age_grace.

OTel Arrow exporters are expected to configure their
max_stream_lifetime property to a value that is slightly smaller
than the receiver's max_connection_age_grace setting, which causes
the exporter to cleanly shut down streams, allowing requests to
complete before the http/2 connection is forcibly closed. While the
exporter will retry data that was in-flight during an unexpected
stream shutdown, instrumentation about the telemety pipeline will show
RPC errors when the exporter's max_stream_lifetime is not configured
correctly.

See the exporter README for more
guidance
. For the
example where max_connection_age_grace is set to 10 minutes, the
exporter's max_stream_lifetime should be set to the same number
minus a reasonable timeout to allow in-flight requests to complete.
For example, an exporter with 9m30s stream lifetime:

exporters:
  otelarrow:
    # other configuration (e.g., endpoint)
	...
    arrow:
      # other configuration (e.g., num_streams)
      max_stream_lifetime: 9m30s

@dfawley
Copy link
Member

dfawley commented Aug 24, 2023

Can you explain this in more detail? In the present configuration with no proxy, I'm seeing the GOAWAY in the client's logs. If there were a proxy, would there not be a GOAWAY signal delivered?

Not all proxies would have this behavior. It would happen for proxies that multiplex connections from clients across multiple servers. E.g.:

image

Consider the case where the client has one connection to the proxy which has one connection to each of the 3 servers. The client starts 3 streams and the proxy decides to forward one to each of the servers for process. Now if "Server 3" wants to shut down, it will send a GOAWAY on its connection back to the proxy:

image

At this point, if the proxy wants to forward that GOAWAY to the client, then the client would have to restart all 3 of its streams. Typically this is not what people would want, and so the proxy would only respond by stopping routing new traffic on that connection, and potentially creating a new connection to that server for new traffic. But the important part is that the real client would never see the GOAWAY, and would not know to clean up the stream.

The only way to do this would be to inform the server application about which streams would be impacted by the connection termination. That way the method handlers for the open streams could do something, e.g. close the stream with an OK status, or send a message on the stream to the client to notify it that shutdown was happening soon, and to clean up gracefully.

But instead of that, why not set a maximum stream age instead of a maximum connection age? This could be set on either the client or server or negotiated between the two, and either the server could notify the client to gracefully end the stream (or face a hard stream termination after some small grace period), or the client could just do it on its own.

jmacd added a commit to open-telemetry/otel-arrow that referenced this issue Aug 24, 2023
…_stream_lifetime (#23)

As discussed in grpc/grpc-go#6504, the client
should add jitter when configuring `max_connection_age_grace` because we
expect each stream will create a new connection. Since connection storms
will not be spread automatically by gRPC in this case, apply client
jitter.

Part of #6.
jmacd pushed a commit to open-telemetry/otel-arrow that referenced this issue Aug 31, 2023
Fixes #6

Applying learning from
grpc/grpc-go#6504 (comment),
which pointed out that the server is return EOF directly when the client
calls CloseSend(), which is causing an error signal on spans.

Instead the server should check if it received EOF from client
(indicating CloseSend() was called) and send StatusOK to the client. The
client will know to restart the stream when it gets a response with
`batchID=-1` and `status=OK`.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants