-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Takes 20 minutes to reconnect to redis #1466
Comments
Any reason this issue is closed? The logs from HTTP handlers are don't seem related to Lettuce. |
The Connection timed out actually is the one part that seems ok.. as that kicks off the process of the ConnectionWatchdog trying to reestablish the connection. The issue more is that it takes 10-15 minutes for this to happen. From what I can tell the "commandhandler" is not detecting that the channel is inactive. |
Thanks for clarifying. If a remote peer dies and comes back up there's no way to detect that state without having traffic. You can generally enable TCP keep-alive for more timely detection of dead peers. #1428 describes a workaround to configure keep-alive options. Lettuce makes proper use of the TCP stack. Depending on what active components Azure introduces, you might experience a different behavior than on your local machine. I'd suggest reaching out to the Azure team to discuss that issue. |
Thanks for the feedback. |
May I close this ticket or is there anything else I can assist you with? |
yes you may... thanks |
Does extended keep-alive settings from #1428 really helps to solve this problem? What I found out is that keep-alive only works when connection is idle and there is no traffic. If there is constant traffic going between application and Redis, and Redis connection stops responding (but is not closed) we see tcp retransmission kicking in after which keep-alive messages aren't sent even if time between retransmission messages is higher than TCP_KEEPIDLE setting. Count of tcp retransmissions is controlled by TCP_RETRIES2 linux setting which default value is 15, and only after that connection is closed from client OS side. Also retransmissions are repeated using some kind of exponential back-off policy, so in total it takes around 15-20 minutes for Lettuce to reconnect in such scenario. During this period all commands to Redis gets timeout errors. Couldn't Lettuce just re-establish connection if it get's x consecutive timeout errors on existing connection? We are using Azure Redis and during it's maintenances we sometimes see behavior that Redis connection stays open and FIN packet doesn't come, but it stops responding, though after restarting application new connection is established successfully. Azure Support helplessly keeps saying that client application has to be resilient to all type of failures including such dead connection. However Lettuce isn't resilient to this problem and this brings us issue that after Azure maintenance we get downtime of ~15 minutes. We currently worked around this by setting up health probe and restarting application if Redis connection is broken for ~1 minute. Other option would be reduce TCP_RETRIES2 setting in Linux. However most reliable solution would be that Lettuce could re-establish connection if it get's x consecutive timeout errors on existing connection. |
How is the client-side supposed to know that Redis is not available when there's no |
One option I see is that Lettuce could try to create new connection after getting Other option could be background thread sending periodic "ping" messages to Redis and try to reconnect to Redis once they start failing. What are your thoughts? Azure told us that they aren't going to fix missing |
Keepalive with a short timing is likely to be the appropriate fix here.
Imagine your connection has issued a That doesn't seem right. |
do we have any solution? |
Currently facing the same kind of issue from a Sping Boot 2.5.3 application (using Lettuce) to a Redis managed service in Azure: it takes about 15 minutes for the application to reconnect to Redis when Redis reboots. The actual reboot duration is much smaller and during these 15 minutes connections from a third party client such as RedisInsight work fine. During the 15 minutes, the exceptions received by the application are: org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 16 second(s) For the record, I get the same answers from Azure tech support than @rimvydas-pranciulis-ruptela , i.e. :
I also expected an improvement in Lettuce such as an optional flag meaning "when some queries timeout, drop the connection and create a new one" (even if it does not work in some cases such as the "BLPOP" example you gave). @mp911de any chance such improvement could happen ? |
I'm facing the same issue especially when connecting to other data centers. And I found gRPC has the same issues. gRPC make a kind of self health check for connection and that make a new connection when ping fails. I tried these code for health check and it works well.
|
This issue is a mess already with all the comments that point into all sorts of directions.
The maintainers do not have an infinite amount of time. The maintainers do not have the possibility to buy accounts for each cloud. The maintainers would accept contributions that help to identify the problem. The maintainer would even accept a pull request. Open source is not a one-way road where maintainers can be seen as an infinite resource. Open source lives from a community and a single person that should fix other people's problems is not a community. |
Bug Report
Current Behavior
Upon losing connection to redis it takes around 15 minutes for connection to be re-established. We consistently see the following over and over in the logs until connection is established.
After 15 or so minutes we then see
Expected behavior/code
We expect the reconnect to happen in a minute or less.
Note, we only see this behavior in Azure K8. Using same docker image (i.e. as one deployed in cloud) run locally (on our macs) we see reconnect happening in less than 1 minute.
Environment
The text was updated successfully, but these errors were encountered: