Takes 20 minutes to reconnect to redis #1466

RossRosem · 2020-10-20T03:40:03Z

Bug Report

Current Behavior

Upon losing connection to redis it takes around 15 minutes for connection to be re-established. We consistently see the following over and over in the logs until connection is established.

19-10-2020 22:19:34.025 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] New http connection, requesting read 
19-10-2020 22:19:34.025 DEBUG r.netty.channel.BootstrapHandlers  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Initialized pipeline DefaultChannelPipeline{(BootstrapHandlers$BootstrapInitializerHandler#0 = reactor.netty.channel.BootstrapHandlers$BootstrapInitializerHandler), (reactor.left.httpCodec = io.netty.handler.codec.http.HttpServerCodec), (reactor.left.httpTrafficHandler = reactor.netty.http.server.HttpTrafficHandler), (reactor.right.reactiveBridge = reactor.netty.channel.ChannelOperationsHandler)} 
19-10-2020 22:19:34.025 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Increasing pending responses, now 1 
19-10-2020 22:19:34.025 DEBUG reactor.netty.http.server.HttpServer  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Handler is being applied: org.springframework.http.server.reactive.ReactorHttpHandlerAdapter@7ad3dcc9 
19-10-2020 22:19:34.028 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Detected non persistent http connection, preparing to close 
19-10-2020 22:19:34.029 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Last HTTP response frame 
19-10-2020 22:19:34.029 DEBUG r.n.http.server.HttpServerOperations  [id: 0x6ef0bde5, L:/10.0.0.123:8080 - R:/10.0.0.6:45848] Last HTTP packet was sent, terminating the channel 
19-10-2020 22:19:45.907 DEBUG r.n.http.server.HttpServerOperations  [id: 0xe376eb28, L:/10.0.0.123:8080 - R:/10.0.0.101:44736] New http connection, requesting read

After 15 or so minutes we then see

stus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] Unexpected exception during request: io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] userEventTriggered(ctx, SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)) 
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.CommandHandler correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] channelInactive() 
19-10-2020 22:20:48.664 DEBUG i.l.core.protocol.DefaultEndpoint correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, epid=0x1] deactivating endpoint handler 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.DefaultEndpoint correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, epid=0x1] notifyQueuedCommands adding 3 command(s) to buffer 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.CommandHandler correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, chid=0x1] channelInactive() done 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] channelInactive() 
19-10-2020 22:20:48.665 DEBUG i.l.core.protocol.ConnectionWatchdog correlationId=81836eb2-f092-44df-b379-adf0d2390bd1 [channel=0xf2c1750a, /10.0.0.123:42450 -> obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380, last known addr=obapi-redis-cache-eastus2-test.redis.cache.windows.net/52.242.79.64:6380] scheduleReconnect()

Expected behavior/code

We expect the reconnect to happen in a minute or less.

Note, we only see this behavior in Azure K8. Using same docker image (i.e. as one deployed in cloud) run locally (on our macs) we see reconnect happening in less than 1 minute.

Environment

Lettuce version(s): Lettuce 5.3.1.RELEASE

mp911de · 2020-10-20T06:50:20Z

Any reason this issue is closed? The logs from HTTP handlers are don't seem related to Lettuce. Connection timed out typically means that a connection attempt has timed out because the remote peer didn't accept the connection request in time.

RossRosem · 2020-10-20T13:18:44Z

The Connection timed out actually is the one part that seems ok.. as that kicks off the process of the ConnectionWatchdog trying to reestablish the connection. The issue more is that it takes 10-15 minutes for this to happen. From what I can tell the "commandhandler" is not detecting that the channel is inactive.

mp911de · 2020-10-20T13:46:44Z

Thanks for clarifying. If a remote peer dies and comes back up there's no way to detect that state without having traffic. You can generally enable TCP keep-alive for more timely detection of dead peers. #1428 describes a workaround to configure keep-alive options.

Lettuce makes proper use of the TCP stack. Depending on what active components Azure introduces, you might experience a different behavior than on your local machine. I'd suggest reaching out to the Azure team to discuss that issue.

RossRosem · 2020-10-20T14:16:19Z

Thanks for the feedback.

mp911de · 2020-10-20T18:39:30Z

May I close this ticket or is there anything else I can assist you with?

RossRosem · 2020-10-20T19:31:32Z

yes you may... thanks

rimvydas-pranciulis-ruptela · 2021-03-01T16:21:28Z

Does extended keep-alive settings from #1428 really helps to solve this problem? What I found out is that keep-alive only works when connection is idle and there is no traffic. If there is constant traffic going between application and Redis, and Redis connection stops responding (but is not closed) we see tcp retransmission kicking in after which keep-alive messages aren't sent even if time between retransmission messages is higher than TCP_KEEPIDLE setting.

Count of tcp retransmissions is controlled by TCP_RETRIES2 linux setting which default value is 15, and only after that connection is closed from client OS side. Also retransmissions are repeated using some kind of exponential back-off policy, so in total it takes around 15-20 minutes for Lettuce to reconnect in such scenario. During this period all commands to Redis gets timeout errors. Couldn't Lettuce just re-establish connection if it get's x consecutive timeout errors on existing connection?

We are using Azure Redis and during it's maintenances we sometimes see behavior that Redis connection stays open and FIN packet doesn't come, but it stops responding, though after restarting application new connection is established successfully. Azure Support helplessly keeps saying that client application has to be resilient to all type of failures including such dead connection. However Lettuce isn't resilient to this problem and this brings us issue that after Azure maintenance we get downtime of ~15 minutes. We currently worked around this by setting up health probe and restarting application if Redis connection is broken for ~1 minute. Other option would be reduce TCP_RETRIES2 setting in Linux. However most reliable solution would be that Lettuce could re-establish connection if it get's x consecutive timeout errors on existing connection.

mp911de · 2021-03-02T10:55:49Z

How is the client-side supposed to know that Redis is not available when there's no FIN packet and TCP retransmit doesn't catch that the server is gone? Happy to improve on that end but I'm left without any ideas what else we could do.

rimvydas-pranciulis-ruptela · 2021-03-02T11:19:37Z

One option I see is that Lettuce could try to create new connection after getting x consecutive timeout errors. Maybe would be worth to add some minimal time period to not close old connection too fast if there was only short connection issue.

Other option could be background thread sending periodic "ping" messages to Redis and try to reconnect to Redis once they start failing.

What are your thoughts? Azure told us that they aren't going to fix missing FIN packet issue.

mp911de · 2021-03-02T11:24:58Z

What are your thoughts? Azure told us that they aren't going to fix missing FIN packet issue.

Keepalive with a short timing is likely to be the appropriate fix here.

Lettuce could try to create new connection after getting x consecutive timeout errors

Imagine your connection has issued a BLPOP 0 foo and awaits a response while another thread tries to issue other commands. Because BLPOP is blocking the read, suddenly you would see a reconnect.

That doesn't seem right.

nitin-patil1 · 2021-08-27T12:14:00Z

do we have any solution?

fbeaufume · 2021-10-12T12:30:19Z

Currently facing the same kind of issue from a Sping Boot 2.5.3 application (using Lettuce) to a Redis managed service in Azure: it takes about 15 minutes for the application to reconnect to Redis when Redis reboots. The actual reboot duration is much smaller and during these 15 minutes connections from a third party client such as RedisInsight work fine.

During the 15 minutes, the exceptions received by the application are: org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 16 second(s)

For the record, I get the same answers from Azure tech support than @rimvydas-pranciulis-ruptela , i.e. :

"Azure Support helplessly keeps saying that client application has to be resilient to all type of failures including such dead connection."

I also expected an improvement in Lettuce such as an optional flag meaning "when some queries timeout, drop the connection and create a new one" (even if it does not work in some cases such as the "BLPOP" example you gave).

@mp911de any chance such improvement could happen ?

before30 · 2021-10-14T15:29:36Z

I'm facing the same issue especially when connecting to other data centers.

And I found gRPC has the same issues. gRPC make a kind of self health check for connection and that make a new connection when ping fails.
https://cs.mcgill.ca/~mxia3/2019/02/23/Using-gRPC-in-Production/

I tried these code for health check and it works well.

LettuceConnectionFactory connectionFactory = (LettuceConnectionFactory)redisTemplate.getConnectionFactory();
try {
connectionFactory.getConnection().ping();
} catch (RedisCommandTimeoutException ex) {
connectionFactory.resetConnection();
}

mp911de · 2021-11-25T08:00:26Z

This issue is a mess already with all the comments that point into all sorts of directions.

Given that the library maintainers seem to have no interest in addressing this

The maintainers do not have an infinite amount of time. The maintainers do not have the possibility to buy accounts for each cloud. The maintainers would accept contributions that help to identify the problem. The maintainer would even accept a pull request.

Open source is not a one-way road where maintainers can be seen as an infinite resource. Open source lives from a community and a single person that should fix other people's problems is not a community.

RossRosem added the type: bug A general bug label Oct 20, 2020

RossRosem closed this as completed Oct 20, 2020

mp911de mentioned this issue Oct 20, 2020

Spring-Lettuce : Taking more than 10 min to reconnect on azure if we reboot redis #1465

Closed

RossRosem reopened this Oct 20, 2020

RossRosem closed this as completed Oct 20, 2020

mp911de added status: invalid An issue that we don't feel is valid and removed type: bug A general bug labels Oct 20, 2020

shelltea mentioned this issue Nov 18, 2022

Introduce ping connecion handler #2253

Closed

4 tasks

jruaux mentioned this issue Oct 17, 2023

Retries only attempted on some connectivity issues redis-field-engineering/redis-kafka-connect#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Takes 20 minutes to reconnect to redis #1466

Takes 20 minutes to reconnect to redis #1466

RossRosem commented Oct 20, 2020

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

rimvydas-pranciulis-ruptela commented Mar 1, 2021 •

edited

Loading

mp911de commented Mar 2, 2021

rimvydas-pranciulis-ruptela commented Mar 2, 2021

mp911de commented Mar 2, 2021

nitin-patil1 commented Aug 27, 2021

fbeaufume commented Oct 12, 2021 •

edited

Loading

before30 commented Oct 14, 2021 •

edited

Loading

mp911de commented Nov 25, 2021

Takes 20 minutes to reconnect to redis #1466

Takes 20 minutes to reconnect to redis #1466

Comments

RossRosem commented Oct 20, 2020

Bug Report

Current Behavior

Expected behavior/code

Environment

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

mp911de commented Oct 20, 2020

RossRosem commented Oct 20, 2020

rimvydas-pranciulis-ruptela commented Mar 1, 2021 • edited Loading

mp911de commented Mar 2, 2021

rimvydas-pranciulis-ruptela commented Mar 2, 2021

mp911de commented Mar 2, 2021

nitin-patil1 commented Aug 27, 2021

fbeaufume commented Oct 12, 2021 • edited Loading

before30 commented Oct 14, 2021 • edited Loading

mp911de commented Nov 25, 2021

rimvydas-pranciulis-ruptela commented Mar 1, 2021 •

edited

Loading

fbeaufume commented Oct 12, 2021 •

edited

Loading

before30 commented Oct 14, 2021 •

edited

Loading