-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Streams] Question About Graceful Stream Termination #6504
Comments
Adding a few relevant details on this, since I'm part of @moh-osman3's investigation. The streaming client we use is defined here: https://github.com/f5/otel-arrow-adapter/tree/main/collector/gen/exporter/otlpexporter/internal/arrow The streaming server we use is defined here: It's working quite well, we just haven't been able to figure out how to terminate the streams in such a way that the instrumentation produced by the instrumentation package. As a user of the gRPC library, I was expecting that each side of the stream could call |
When the Is this not what you are seeing when the Also, what version of gRPC are you using?
|
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
Thanks for the response @easwars!
We seem to be using grpc v1.56.0 https://github.com/f5/otel-arrow-adapter/blob/main/go.mod#L60
To us, the meaning of graceful is that both the client and server see no errors, and that when the server has a keepalive setting the client and server are able to coordinate to shutdown cleanly. We aren't sure how gRPC expects us to do this, and we need to be able to do this in a way that standard tracing instrumentation sees as a clean exit. The exact status we are seeing is
This leaves us with two questions:
|
No, sorry. It's not possible to distinguish between a server shutting down due to max age vs. another kind of connection loss.
Why do you need this? Long-lived streams killed by the connection being lost (including via max age) will be terminated with a non-OK status. This is not a "clean shutdown", because the connection was hard-closed out from underneath it, meaning it could have been in the middle of receiving data/etc. It is proper to classify it as an error.
We definitely do not recommend relying upon the strings contained in errors to be used for anything besides manual debugging. |
@dfawley -- We asked a different question. If the stream client could learn about the first GOAWAY, then it could responsibly initiate a clean shutdown for itself. We are looking for a way the client can detect a GOAWAY as application-level code. Is there a signal that gRPC can deliver to the client code? How else can we describe the disconnect as "graceful" if the client can't see it coming and sees a connection loss? We'll do whatever it takes, but it doesn't look like gRPC gives the client a way to know about the keepalive, and we're not sure what else to do to close a stream without data loss.
After we resolve the question above, about how we are meant to shut down streaming RPCs without data loss, we need to be sure that it will look like a non-error in the instrumentation. When we resolve the first question above, the question will be:
|
Yes, that's what I was answering. The client cannot distinguish between a server shutting down due to max age vs. another kind of connection loss. Note that if there is a proxy involved, it's possible to have one client using many streams on a single connection that are actually being routed to multiple backend servers. So there would have to be something in the gRPC stream protocol to transmit this information to the client, but nothing like that exists, or (probably?) is even possible to add. Server-side, it may be possible to add an API that indicates to all the streams on a connection that they will be terminated, but this would be a pretty significant cross-language effort, with a relatively low benefit compared to our other priorities at the moment. Feel free to propose something more concrete on our list or at the main gRPC repo, but unfortunately we probably would not be able to prioritize it very highly.
The word "graceful" is used because it allows a "grace period" for clients to finish their open RPCs. Generally, the advice we give is to make your long-lived streams have a finite amount of time that they can last, and to set your grace timeout to be longer than that time. |
How does it allow that for streaming RPCs? I think that is the problem here, we have no way to know to stop starting new RPCs when there has been a GOAWAY received. I don't see why this requires server changes, just a signal to the client to start winding down?
I don't see how |
Streaming RPCs and long-lived streams are two different topics. You're apparently talking about long-lived streams, which means this does not work nicely for them. The same thing would be true of a long-poll unary RPC as well. If the RPC's deadline is longer than the keepalive grace time, or infinite, then it will ultimately be hard killed when the grace period is over.
The
The issue is that the client can't reliably get this signal if it goes through a proxy, since it's a signal on the connection and not the stream. (Proxies can forward different streams for a single incoming connection to multiple outgoing connections.)
By setting a maximum time limit on all your RPCs to less than your grace period. This is the only way to do that right now. |
@dfawley So I think @jmacd and I have tried this, but wondering if maybe we are not doing this correctly. On the stream client we set a Now we are seeing a
Is there any possible way for us to set up the server to send a status of codes.OK in this case, where the client has called CloseSend? i.e. how can we client-initiate a stream shutdown with no errors? |
Yes, we want streams to live as long as the server will allow because the longer they live the more compression benefit we get. As you said, proxies could lower the timeout so the client will have to guess at the maximum lifetime it should use. Well, since we're here -- why not use Go's context to solve this problem? The client could easily select on a context associated with the stream if there were one available. |
Definitely
It's not an API-level concern here. The issue is that there is no stream-level signal that can go from the server to the client when it's ready to shut down the connection; GOAWAYs are used for this and they are conneciton-level. The presence of proxies would undermine this approach. And even if you aren't using a proxy today, you might decide to one day in the future. So if you were to design around this and then wanted to add proxies, you'd be unable to. |
I'm still confused by
The client's gRPC logs contain the first GOAWAY, so the gRPC library or its subordinate libraries know about the GOAWAY. What is preventing connecting these signals? I.e., I don't see how this has to go from the server to the client -- the client just wants a way to know before the connection is terminated and the client has already logged about a GOAWAY event. The client has some information, only the stream doesn't? |
Yes, the client knows about the GOAWAY. But not if it's behind a proxy that doesn't forward the GOAWAY from the server, which it wouldn't do if it was multiplexing streams to multiple backend servers, which is pretty normal.
The server is the thing deciding to cycle the connection. It's the thing sending the GOAWAY.
Yes, we could technically plumb any GOAWAYs the client receives to the streams on that connection. The problem is that that is not a reliable way to determine a stream will be closed soon anymore if you put it behind a proxy. So if you designed around that and then added a proxy, your design wouldn't work anymore. |
Can you explain this in more detail? In the present configuration with no proxy, I'm seeing the GOAWAY in the client's logs. If there were a proxy, would there not be a GOAWAY signal delivered? I can't understand how the proxy would break this design, it would just shorten the effective keepalive -- as long as a GOAWAY signal is delivered to the client, the client can end streams gracefully and reconnect more often. Presumably this would shorten the maximum connection lifetime, but the long-lived RPC would see the signal and be able to shut itself down sooner--and still gracefully. I've looked into the code a bit more, and there is a potential problem that I see with your recommendation to limit RPC duration to less than the stream These streams will be used heavily during their lifetime, and we do not expect to see a benefit from connection sharing. We are setting a keepalive to ensure that load is rebalanced, and since there is no benefit from connection sharing, I am inclined to set
but it will require a fair bit of explaining why the age is shorter than the grace period, and still it requires a completely separate mechanism for the client to implement its own timeout. From a system design perspective, it would be a lot nicer if the client could get this signal from the connection (i.e., the server or the intermediate proxy). The recommendation will read as follows:As a gRPC streaming service, the OTel Arrow receiver is able to limit Keepalive settings are vital to the operation of OTel Arrow, because gRPC libraries do not build-in a facility for long-lived RPCs to learn
In the example configuration above, OTel-Arrow streams will have reset OTel Arrow exporters are expected to configure their See the exporter README for more
|
…_stream_lifetime (#23) As discussed in grpc/grpc-go#6504, the client should add jitter when configuring `max_connection_age_grace` because we expect each stream will create a new connection. Since connection storms will not be spread automatically by gRPC in this case, apply client jitter. Part of #6.
Fixes #6 Applying learning from grpc/grpc-go#6504 (comment), which pointed out that the server is return EOF directly when the client calls CloseSend(), which is causing an error signal on spans. Instead the server should check if it received EOF from client (indicating CloseSend() was called) and send StatusOK to the client. The client will know to restart the stream when it gets a response with `batchID=-1` and `status=OK`.
What is the issue?
Currently using grpc-go library to perform streaming RPCs between two components (Opentelemetry Collectors) that have span/metric instrumentation. I'm wondering what is the proper way for stream to be terminated if there is instrumentation involved? I'm seeing errors returned from the server when the stream is terminated instead of getting an
OK
status.What I'm currently seeing:
Seeing
NO_ERROR
orUNKNOWN
error codes returned from the server.We first noticed an issue where streams are being closed on the server side when the server's
keepalive
settings reaches amax_connection_age
+ grace. This causes the server to returnNO_ERROR
to the client and this error shows up on spans from the server. This is not ideal because this adds a noisy/misleading signal in many spans that there is an issue.On the other hand we tried to close the stream on the client side by calling
client.CloseSend()
before themax_connection_age
is reached in the server, which prevents the server from sendingNO_ERROR
. Instead we are now gettingEOF
withcodes.Unknown
, which is still not ideal.My question:
Our instrumentation is currently using the grpc interceptor pattern and we are trying to determine a couple things.
NO_ERROR
returned rather thanOK
status? i.e. can we improve on theNO_ERROR
?Code()
to ensure that spans areOK
?Overall our goal is for stream shutdowns to be graceful and have no errors. Any thoughts? Thank you!
The text was updated successfully, but these errors were encountered: