-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-establish connections in the case of Master/Slave failover #338
Comments
Hi @tomfitzhenry, Static-Master-Slave connections are designed intentionally that way - AWS ElastiCache reports internal IP addresses and the connection point details need to be user-provided. Regular Master-Slave connections could be self-discovering. That would lead to a crawler-like behavior in which lettuce could discover all members of the Master-Slave setup, but it comes with some challenges:
|
I work with Tom and put together a bit of code externally (calls shutdown and makes new client) to Lettuce to recreate the client when it seemed to need topology or node updates. From tests using the workaround, it seems the following would be a great help if it could be included in Lettuce to help.
|
Thanks for the detailed description. lettuce provides a similar facility for Redis Cluster (listening to events during operations; adaptive topology refresh). I think it would make sense to expose the refresh trigger API and accept a custom Users are able to build their own |
@mp911de Follow my use case: @Bean
public RedisClient redisClient() {
return RedisClient.create(DefaultClientResources
.builder()
.dnsResolver(new DirContextDnsResolver())
.reconnectDelay(Delay.constant(Duration.ofSeconds(reconnectionDelay)))
.build());
}
@Bean
public StatefulRedisMasterSlaveConnection<String, String> redisConn(RedisClient redisClient) {
RedisURI master = RedisURI.create("redis://****-001.****.****.****.amazonaws.com:6379");
RedisURI slave = RedisURI.create("redis://****-002.****.****.****.amazonaws.com:6379");
StatefulRedisMasterSlaveConnection<String, String> connect = MasterSlave.connect(
redisClient,
Utf8StringCodec.UTF8,
Arrays.asList(master, slave));
connect.setTimeout(Duration.ofSeconds(readTimeout));
return connect;
} |
Hey @jaimebrolesi Longer version: AWS Elasticache Master/Slave (and Master/Slave as known from Redis, without Sentinel) do not provide any details over topology changes. There's no possibility to discover that a failover (or reconfiguration) has happened. I'm not terribly familiar with AWS, maybe AWS provides events that can be captured in such case. Because Master/Slave changes are typically an operational task that is performed outside of Redis, we made the assumption that these things don't happen when an application is running. Changing a Master/Slave setup is basically not constrained in any way, so we can't assume that the currently connected nodes will persist upon a change. Failovers/changes require an application restart to take effect of the new configuration. |
@mp911de OR We can define a new topology strategy for AWS, if any command timeout happens for 3 times (configurable), we can start a re-connection policy using the same INFO replication command strategy from MasterSlaveTopologyProvider on getNodes() method. This is possible because AWS give to us 2 kinds of connection a loadbalanced (DNS issues) and individual one (a hostname for each node) What do you think?! I can help with AWS explanation or coding :) hehehe. |
I think we can increase visibility/provide an SPI to either trigger a refresh from outside or supply endpoint details so AWS-specific tooling can contribute to the topology/topology update. We will not introduce Cloud-specific (in this case AWS-specific) functionality to Lettuce that uses non-Redis infrastructure. |
I'm investigating a similar issue to AWS elasticache but I'm also experiencing problems relating to DNS caching. In our use case it seems like hostname resolutions are cached forever regardless of the DNS resolver used because of the behaviour of SocketAddressResolver. We're using lettuce 5.0.4 in Spring Boot 2.0.1 We setup our RedusConnectionFactory in the following way ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(Duration.ofSeconds(30))
.enableAllAdaptiveRefreshTriggers()
.build();
ClientOptions clientOptions = ClusterClientOptions.builder()
.topologyRefreshOptions(topologyRefreshOptions)
.build();
ClientResources clientResources = DefaultClientResources.builder().dnsResolver(DnsResolvers.JVM_DEFAULT).build();
RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
LettuceClientConfiguration lettuceClientConfiguration = LettuceClientConfiguration.builder()
.clientResources(clientResources)
.clientOptions(clientOptions).build();
return new LettuceConnectionFactory(redisClusterConfiguration, lettuceClientConfiguration); (I've also tried not changing the DNS at all and using the default) We're not using the static topology because we have a VPC connection into AWS and the member node IPs are accessible to us. Also if I understand DATAREDIS-580 and DATAREDIS-762 correctly it's currently not possible to use the static topology with spring data. The above works fine except in the case where the IP address mapped to the hostname changes. This can be triggered manually by deleting and recreating the cluster with the same name but also the AWS docs explicitly warn that DNS mappings should not be cached and are prone to change. My experience with AWS services matches that. The problem seems to be here in SocketAddressResolver The DNS resolution is skipped if it's already been resolved. It's very possible I'm missing somewhere but I don't see any way to configure the connection in a way that will resolve the DNS again upon reconnection. Is there something we can do within the bounds of spring data to resolve this? Will StatefulRedisMasterSlaveConnection improve the situation? |
The mentioned line is used when configuring Lettuce to use Unix Domain Sockets. Then, we use local file resolution to resolve the file descriptor. Using If this does not help, please file a new bug report along some details so we can have a closer look. |
guys, I am dealing with this same issue, after having a AWS failover I got an exception because I can write to read instances because the master/read was discover at the beginning, is there any config to allow reconnections in this exception case? do you think we can do something similar to issue 822 trying to reconnect in case of exception based on some configuration? Do you think useful to create a jira for this? or this is something which is not going to be added? |
Luis,
Whenever a failover occurs your program will need 60 seconds (time that AWS
needs to change de IP from ELB) to identify the change between the write
and read machine.
For some weird reason Java DNSResolver has problems to solve the IP change
on AWS environment, for this reason Mark developed the
*DirContextDnsResolver*.
Everything you can do is change the resolver for reconnection and live with
this 60 seconds of "writes" exception because like I sad is the time that
AWS needs to change de IP.
Em sáb, 16 de fev de 2019 às 11:06, Luis De Bello <notifications@github.com>
escreveu:
… guys, I am dealing with this same issue, after having a AWS failover I got
an exception because I can write to read instances because the master/read
was discover at the beginning, is there any config to allow reconnections
in this exception case? do you think we can do something similar to issue
822 trying to reconnect in case of exception based on some configuration?
Do you think useful to create a jira for this? or this is something which
is not going to be added?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#338 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKKRtmCibMg21hWUTTedsJA6pEUvJQ39ks5vOAJMgaJpZM4JoUfv>
.
|
@ldebello I haven't had a chance to do a failover test on the newer versions but in 5.0.4 setting |
@mp911de Do you have any pointers on this, do you think this is still a good way to solve this issue? Wouldn't mind working on this. |
The issue requires some design and this is the hard part. Writing down the code is the easy part here. |
No. #672 is a Redis Cluster issue. This one is Master/Slave without Redis Cluster. |
Hi @mp911de, I've been trying to read quite a bit on this (in particular #1008 and this issue) as we're having a very similar issue and have been looking for a workaround.
We use the default Spring Boot Data Redis' autoconfiguration (with pooling) by providing host and port (so basically a When a primary/replica failover occurs, where the primary doesn't die but just change role, the application is not able to recover from it. Existing connections are still connected to the former primary node and write commands fail with We've been looking at ways to catch that particular exception and force the This is the best we could find at the moment
It looks something like this try {
// write to redis
} catch (RedisSystemException e) {
if (e.getCause() instanceof RedisCommandExecutionException) {
if (e.getCause().getMessage().startsWith("READONLY")) {
final RedisConnection connection = connectionFactory.getConnection();
connectionFactory.resetConnection();
((RedisAsyncCommands) connection.getNativeConnection()).getStatefulConnection().close();
}
}
} After a few errors, eventually the connections are recreated and the application recovers. Of course this is ugly as hell and most likely not intended at all. I'm also worried about unintended side-effects. We're looking for guidance on how to best handle this scenario gracefully (manually restarting our dozen of instances is not an option) |
Hi guys, |
@charlesardsilva AWS has added master endpoint and reader endpoint so you could use those dns to solve the issue. @Bean
public RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration(
@Value("${spring.redis.master-host:localhost}") String masterHost,
@Value("${spring.redis.slave-host:localhost}") String slaveHost,
@Value("${spring.redis.port:9991}") Integer port) {
RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration =
new RedisStaticMasterReplicaConfiguration(masterHost, port);
redisStaticMasterReplicaConfiguration.addNode(slaveHost, port);
return redisStaticMasterReplicaConfiguration;
}
@Bean
public LettuceConnectionFactory connectionFactory(
RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration) {
final SocketOptions socketOptions = SocketOptions.builder().connectTimeout(this.redisConnectTimeout).build();
final ClientOptions clientOptions = ClientOptions.builder()
.socketOptions(socketOptions)
.autoReconnect(true)
.build();
LettuceClientConfiguration clientConfiguration = LettuceClientConfiguration.builder()
.readFrom(ReadFrom.SLAVE_PREFERRED)
.clientOptions(clientOptions)
.commandTimeout(this.redisCommandTimeout)
.build();
return new LettuceConnectionFactory(redisStaticMasterReplicaConfiguration, clientConfiguration);
} |
@ldebello
Can you solve this problem? |
Currently we accepted the provide balance,it is not perfect but at least we can use master/replicas. |
Nothing that we could do from a Lettuce-only perspective. Lettuce requires additional information about the new (changed) topology and that is something that needs to be provided externally. We faced a few times the requirement to refresh the topology on demand. We could provide a
We need to figure out what a |
@mp911de I don't know if I'm right but Lettuce uses the INFO command by default to identify who is the master and slave, right? So, I believe that is the problem. When you use the AWS ElastiCache, the IP from INFO command is an AWS internal IP. The real IP or VPC IP is unavailable on info command. For AWS the correct way to reload the nodes is to use the DNS informed on the Topology configuration. |
That is exactly why we have the
mode to take user-specified endpoint addresses. Lettuce figures out the roles from the given array of endpoints.
which is intended mostly for on-premise setups. We do not want to integrate with any sort of Cloud provider SDK as Lettuce is a Redis client, not a swiss army cloud knife.
Single-node failover already works this way as the hostname is resolved upon (re)connect. |
Let me see if my understanding is correct... AWS ElastiCache behaviorAWS provides DNS CNAMEs for
AWS ElastiCache will automatically reconfigure the topology based on various events (individual node failure, manual master promotion, manually adding or removing a node in the AWS web console, etc). When any of these events occur, ElastiCache will make the changes to the topology and then update the DNS CNAME records when complete. Example use caseLet's assume we have a long running process using Lettuce to interact with the ElastiCache topology. This process has both write requirements, consistent read requirements (read from master), and high volume read requirements where stale data is acceptable (e.g. ok to read from a replica). For a short-lived program we could probable get away with simply creating a "Static Master/Replica using provided endpoints" configuration based on DNS names for the master and replica load balancer. For a long-live program we need (1) the ability to detect a topology change event and (2) the ability to reconfigure our Lettuce client triggered by (1). Let's also assume we are not running Redis in a cluster topology, just a simple master/replica configuration. Let's also assume the client is running in AWS as well, so we can take advantage of DNS resolving to the non-public IP address for the Redis nodes. ApproachFor ElastiCache it seems like there are two possible approaches we can take.
Initialization Topology change discovery Lettuce client update My understanding is that there is more advanced support in Lettuce for detecting topology changes and updating the Lettuce client accordingly for both Sentinel and Redis Cluster. For the simple Master/Replica configuration some of the Lettuce API's do not apply. It seems reasonable for Lettuce to not build cloud-specific logic into Lettuce itself. At the same time, it seems the ElastiCache use case in non-cluster mode is a common one. Users want the ElastiCache update mechanism to play nicely with Lettuce in a concurrent processing system. At this point, it is not clear how to do that. There seems to be an impedance mismatch. Hence this open github issue. An ideal outcome would be documentation and example code on the Lettuce wiki that demonstrates how to use Lettuce with ElastiCache for high availability failover in a non-cluster, master/replica configuration. This may or may not prompt enhancements in Lettuce itself. |
Let's go back to the first problem and talk about the response to failover node 1-primary after failover node 1-replica However, in the StaticMasterSlave configuration the known nodes of lettuce still know node 1 as primary. This is because the StaticMasterSlave configuration does not support refreshing RedisNodeDescription after the initial connection.
This is understood. However, I think it needs to be open so that the user can raise the event or act directly.
or
I also think that catching dynamically growing quantities is not something the StaticMasterSlave strategy will do. However, the above strategy can achieve redistribution of static nodes, and the reconnect logic for a single node can work well. |
It will be great if we can have an approach like this https://github.com/luin/ioredis#reconnect-on-error, even with Spring wrappers |
Hi Everyone, I am trying to write some code in my application to re-initialise the redis client after failover. Instead of catching the exception on read/write, Is there a way to intercept all read/write operations and re-initialise a new client connection on failure resulted from failover operation? Can someone suggest if there is a way to do it ? |
Right now, there's no method to apply a new topology to a |
Hi Mark,
Thanks for quick response. Indeed I am trying to re-obtain the redis
connection. But I am not in favor of obtaining the connection by
surrounding the read/write operation with a try-catch block, as it needs
code changes at every place when the read/write operation is invoked.
Instead I want to intercept all read/write methods to achieve this by using
spring aspects. Read/write operations are defined inside
DefaultValueOperations class and I can't create a bean of it to implement
the aspect as its access modifier is DEFAULT. Is there a way to intercept
the read/write operations?
Thanks in advance.
Vikram
…On Fri, Apr 9, 2021 at 3:50 PM Mark Paluch ***@***.***> wrote:
Right now, there's no method to apply a new topology to a
StatefulRedisMasterReplicaConnection. The only thing possible is to
re-obtain StatefulRedisMasterReplicaConnection.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#338 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARZ6CT3MKWTETTWT66XCJADTH3IGJANCNFSM4CNBI7XQ>
.
|
Multi-node connections are always subject to mixed availability. Depending on what hosts are up/down, a connection may work for parts of the commands. There's no way to tell from the outside. |
Ok. Can we do something at a unified place instead of littering all over the code base with try-catch, so that the re-initialisation will be triggered on read/write failure? Any small clue or idea will be a great help. Thanks. |
Using Spring, enable |
Can you explain more how I'm getting the following error on failovers |
Ran into this -- somehow one of our app's replicas didn't catch the topology update coming from sentinel or didn't respond to it correctly. Ideally there would be a nice easy way to, if we encounter readonly exceptions, trigger a topology update and recover. |
Is this issue being solved? I've just enountered such problem with ElastiCache, one replica was promoted to master (3 shards, master+2 replicas), and App couldn't write (PUT,DEL operations were failing) for 15 minutes.. Problem was solved by redeploying the App (thus redis client was reinitialised with a new topology). How to mitigate that problem ? |
Hi all, any news on this? I have a similar problem. I configured my spring-boot application to use the lettuce client. I have configured 6 Redis nodes on Elasticache (3 masters + 3 slaves). When a master goes down (to test a failover), the application stop working because it continues to try to connect to the old master, and the application gets a connection timeout. Following the configuration: spring |
The master-slave docs say:
So that users don't have to implement this themselves, it'd be great if lettuce could do this transparently.
The text was updated successfully, but these errors were encountered: