Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterClient Recepionist & Client failure detector race conditions #2312

Closed
Aaronontheweb opened this issue Sep 20, 2016 · 5 comments · Fixed by #3866
Closed

ClusterClient Recepionist & Client failure detector race conditions #2312

Aaronontheweb opened this issue Sep 20, 2016 · 5 comments · Fixed by #3866

Comments

@Aaronontheweb
Copy link
Member

Been able to verify the existence of a bug with two different sets of logs from the Akka.Cluster.Tools.MultiNode.ClusterClientSpec, which I've attached via zip file here.

Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 1.zip

Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 2.zip

There's a possibility that this could be a bug with the spec itself, but I'm skeptical of that given that lengthy periods of time that this error spans and the fact that the error occurs in both directions:

Failure set 1 reveals that the ClusterClient subscribers clients are not notified on-time that a receptionist has gone down.

 RunOn(() =>
                {
                    // Locate the test listener from a previous test and see that it agrees
                    // with what the client is telling it about what receptionists are alive
                    var l = Sys.ActorSelection("/user/reporter-client-listener");
                    var expectedContacts = _remainingServerRoleNames.Select(c => Node(c) / "system" / "receptionist");
                    Within(10.Seconds(), () =>
                    {
                        AwaitAssert(() =>
                        {
                            var probe = CreateTestProbe();
                            l.Tell(ClusterClientSpecConfig.TestClientListener.GetLatestContactPoints.Instance, probe.Ref);
                            probe.ExpectMsg<ClusterClientSpecConfig.TestClientListener.LatestContactPoints>()
                                .ContactPoints.Should()
                                .BeEquivalentTo(expectedContacts);
                        });
                    });
                }, _config.Client);

                EnterBarrier("after-4");

On node 1, the client, fails because

[Node1][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node1][FAIL-EXCEPTION] Message: Expected collection {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist} to be equivalent to {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist}, but it contains too many items.

The receptionist at 53084 was terminated earlier in the spec and we verified that it was terminated, yet the subscriber actor on the client never received a notification despite the fact that roughly 15-20 seconds elapsed between the node being terminated and this assertion running out of time. I checked that in the logs. That indicates that this is an unsafe or unaccounted for failed write somewhere inside the ClusterClient and not with the spec.

Failure 2 is from a different test run, and it indicates this same problem in reverse: that the receptionist's subscribers aren't notified about a client becoming unreachable even after a long period of time elapsing.

The exception fails here for just one of the receptionist nodes, although the others fail immediately after it since the barrier doesn't get passed.
 RunOn(() =>
                {
                    // Only run this test on a node that knows about our client. It could be that no node knows
                    // but there isn't a means of expressing that at least one of the nodes needs to pass the test.
                    var r = ClusterClientReceptionist.Get(Sys).Underlying;
                    r.Tell(GetClusterClients.Instance);
                    var cps = ExpectMsg<ClusterClients>();
                    if (cps.ClusterClientsList.Any(c => c.Path.Name.Equals("client")))
                    {
                        Log.Info("Testing that the receptionist has just one client");
                        var l = Sys.ActorOf(
                            Props.Create(() => new ClusterClientSpecConfig.TestReceptionistListener(r)),
                            "reporter-receptionist-listener");

                        var c = Sys
                            .ActorSelection(Node(_config.Client) / "user" / "client")
                            .ResolveOne(Dilated(2.Seconds())).Result;

                        var expectedClients = ImmutableHashSet.Create(c);
                        Within(10.Seconds(), () =>
                        {
                            AwaitAssert(() =>
                            {
                                var probe = CreateTestProbe();
                                l.Tell(ClusterClientSpecConfig.TestReceptionistListener.GetLatestClusterClients.Instance, probe.Ref);
                                probe.ExpectMsg<ClusterClientSpecConfig.TestReceptionistListener.LatestClusterClients>()
                                    .ClusterClients.Should()
                                    .BeEquivalentTo(expectedClients);
                            });
                        });

                    }

                }, _config.First, _config.Second, _config.Third);

Similar issue as before:

[Node4][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node4][FAIL-EXCEPTION] Message: Expected collection {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129], [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client1#1210653891]} to be equivalent to {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129]}, but it contains too many items.

client1 was terminated two test methods ago during ClusterClient_must_communicate_to_any_node_in_cluster, and according to the logs roughly 15 seconds have elapsed since the Context.Stop call to client1 and this assertion failing. This would indicate again that somehow the state for at least one receptionist isn't being updated consistently.

Any ideas what the comment element is between these two bugs? cc @alexvaluyskiy @Horusiath

@Horusiath
Copy link
Contributor

Horusiath commented Jan 4, 2017

After investigating the issue with Sean, I think that possible reason is that ClusterRecipient doesn't get notified with ClusterEvent.MemberRemoved. The most probable reason behind that is that unreachable nodes doesn't get downed correctly.

From what you've shown it looks like actual list of contact has one more entry than the expected one (a node listening on a port 53084). However from logs (search for phrase Leader is auto-downing unreachable node) it looks like that node is never downed for some reason.

@SeanFarrow
Copy link
Contributor

Do you have any idea what the reason this node isn't autodowned is? Is there something we can do with the unreachable event?

@Horusiath
Copy link
Contributor

Horusiath commented Jan 4, 2017 via email

@SeanFarrow
Copy link
Contributor

SeanFarrow commented Jan 4, 2017 via email

@alexvaluyskiy alexvaluyskiy modified the milestones: 1.2.0, 1.3.0 Apr 6, 2017
@alexvaluyskiy alexvaluyskiy removed this from the 1.3.0 milestone Jul 18, 2017
Aaronontheweb added a commit that referenced this issue Jul 26, 2019
Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes.

Close #2535
Close #2312
Close #3840

* implemented akka/akka#24167

* implemented akka/akka#22992
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this issue Jul 26, 2019
Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes.

Close akkadotnet#2535
Close akkadotnet#2312
Close akkadotnet#3840

* implemented akka/akka#24167

* implemented akka/akka#22992
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants