ClusterClient Recepionist & Client failure detector race conditions #2312

Aaronontheweb · 2016-09-20T03:16:17Z

Been able to verify the existence of a bug with two different sets of logs from the Akka.Cluster.Tools.MultiNode.ClusterClientSpec, which I've attached via zip file here.

Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 1.zip

Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 2.zip

There's a possibility that this could be a bug with the spec itself, but I'm skeptical of that given that lengthy periods of time that this error spans and the fact that the error occurs in both directions:

Failure set 1 reveals that the ClusterClient subscribers clients are not notified on-time that a receptionist has gone down.

 RunOn(() =>
                {
                    // Locate the test listener from a previous test and see that it agrees
                    // with what the client is telling it about what receptionists are alive
                    var l = Sys.ActorSelection("/user/reporter-client-listener");
                    var expectedContacts = _remainingServerRoleNames.Select(c => Node(c) / "system" / "receptionist");
                    Within(10.Seconds(), () =>
                    {
                        AwaitAssert(() =>
                        {
                            var probe = CreateTestProbe();
                            l.Tell(ClusterClientSpecConfig.TestClientListener.GetLatestContactPoints.Instance, probe.Ref);
                            probe.ExpectMsg<ClusterClientSpecConfig.TestClientListener.LatestContactPoints>()
                                .ContactPoints.Should()
                                .BeEquivalentTo(expectedContacts);
                        });
                    });
                }, _config.Client);

                EnterBarrier("after-4");

On node 1, the client, fails because

[Node1][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node1][FAIL-EXCEPTION] Message: Expected collection {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist} to be equivalent to {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist}, but it contains too many items.

The receptionist at 53084 was terminated earlier in the spec and we verified that it was terminated, yet the subscriber actor on the client never received a notification despite the fact that roughly 15-20 seconds elapsed between the node being terminated and this assertion running out of time. I checked that in the logs. That indicates that this is an unsafe or unaccounted for failed write somewhere inside the ClusterClient and not with the spec.

Failure 2 is from a different test run, and it indicates this same problem in reverse: that the receptionist's subscribers aren't notified about a client becoming unreachable even after a long period of time elapsing.

The exception fails here for just one of the receptionist nodes, although the others fail immediately after it since the barrier doesn't get passed.

 RunOn(() =>
                {
                    // Only run this test on a node that knows about our client. It could be that no node knows
                    // but there isn't a means of expressing that at least one of the nodes needs to pass the test.
                    var r = ClusterClientReceptionist.Get(Sys).Underlying;
                    r.Tell(GetClusterClients.Instance);
                    var cps = ExpectMsg<ClusterClients>();
                    if (cps.ClusterClientsList.Any(c => c.Path.Name.Equals("client")))
                    {
                        Log.Info("Testing that the receptionist has just one client");
                        var l = Sys.ActorOf(
                            Props.Create(() => new ClusterClientSpecConfig.TestReceptionistListener(r)),
                            "reporter-receptionist-listener");

                        var c = Sys
                            .ActorSelection(Node(_config.Client) / "user" / "client")
                            .ResolveOne(Dilated(2.Seconds())).Result;

                        var expectedClients = ImmutableHashSet.Create(c);
                        Within(10.Seconds(), () =>
                        {
                            AwaitAssert(() =>
                            {
                                var probe = CreateTestProbe();
                                l.Tell(ClusterClientSpecConfig.TestReceptionistListener.GetLatestClusterClients.Instance, probe.Ref);
                                probe.ExpectMsg<ClusterClientSpecConfig.TestReceptionistListener.LatestClusterClients>()
                                    .ClusterClients.Should()
                                    .BeEquivalentTo(expectedClients);
                            });
                        });

                    }

                }, _config.First, _config.Second, _config.Third);

Similar issue as before:

[Node4][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node4][FAIL-EXCEPTION] Message: Expected collection {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129], [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client1#1210653891]} to be equivalent to {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129]}, but it contains too many items.

client1 was terminated two test methods ago during ClusterClient_must_communicate_to_any_node_in_cluster, and according to the logs roughly 15 seconds have elapsed since the Context.Stop call to client1 and this assertion failing. This would indicate again that somehow the state for at least one receptionist isn't being updated consistently.

Any ideas what the comment element is between these two bugs? cc @alexvaluyskiy @Horusiath

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2016-11-01T00:01:39Z

Just for clarification, it's https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools.Tests.MultiNode/ClusterClient/ClusterClientSpec.cs#L520 that fails on Node 1. in the first scenario

And it's https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools.Tests.MultiNode/ClusterClient/ClusterClientSpec.cs#L448 that fails on the second scenario on Node 4.

Horusiath · 2017-01-04T11:58:07Z

After investigating the issue with Sean, I think that possible reason is that ClusterRecipient doesn't get notified with ClusterEvent.MemberRemoved. The most probable reason behind that is that unreachable nodes doesn't get downed correctly.

From what you've shown it looks like actual list of contact has one more entry than the expected one (a node listening on a port 53084). However from logs (search for phrase Leader is auto-downing unreachable node) it looks like that node is never downed for some reason.

SeanFarrow · 2017-01-04T12:30:39Z

Do you have any idea what the reason this node isn't autodowned is? Is there something we can do with the unreachable event?

Horusiath · 2017-01-04T12:39:06Z

This is strange, while leader is downing a node on the different port, ultimately a correct node gets downed. Maybe it's bug in the logging itself. This is from logs: [WARNING][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Marking node(s) as UNREACHABLE [Member(address = akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084, status = Up, role=[], upNumber=1)]. Node roles [] [INFO][9/20/2016 12:59:02 AM][Thread 0022][Cluster] Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Leader is auto-downing unreachable node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] [DEBUG][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down [INFO][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] Marking unreachable node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084] as [Down] 2017-01-04 13:30 GMT+01:00 Sean Farrow <notifications@github.com>:

…

Do you have any idea what the reason this node isn't autodowned is? Is there something we can do with the unreachable event? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2312 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApfGTXd3EozGq0bL-1Ny-I8w9xavLbiks5rO5DwgaJpZM4KBLpz> .

SeanFarrow · 2017-01-04T12:42:35Z

Ok, can you add to the issue, also, if this is the case what is causing the test to fail? From: Bartosz Sypytkowski [mailto:notifications@github.com] Sent: Wednesday, January 04, 2017 12:39 To: akkadotnet/akka.net <akka.net@noreply.github.com> Cc: Sean Farrow <sean.farrow@seanfarrow.co.uk>; Comment <comment@noreply.github.com> Subject: Re: [akkadotnet/akka.net] ClusterClient Recepionist & Client failure detector race conditions (#2312) This is strange, while leader is downing a node on the different port, ultimately a correct node gets downed. Maybe it's bug in the logging itself. This is from logs: [WARNING][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Marking node(s) as UNREACHABLE [Member(address = akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084, status = Up, role=[], upNumber=1)]. Node roles [] [INFO][9/20/2016 12:59:02 AM][Thread 0022][Cluster] Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Leader is auto-downing unreachable node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] [DEBUG][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] [Initialized] Received Akka.Cluster.ClusterUserAction+Down [INFO][9/20/2016 12:59:02 AM][Thread 0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]] Marking unreachable node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084] as [Down] 2017-01-04 13:30 GMT+01:00 Sean Farrow <notifications@github.com>:

Do you have any idea what the reason this node isn't autodowned is? Is there something we can do with the unreachable event? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2312 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AApfGTXd3EozGq0bL-1Ny-I8w9xavLbiks5rO5DwgaJpZM4KBLpz> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#2312 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABY1fk4a4Jc0WQc5vcOSIcA2ZCXpgfTdks5rO5L1gaJpZM4KBLpz>.

Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes. Close #2535 Close #2312 Close #3840 * implemented akka/akka#24167 * implemented akka/akka#22992

Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes. Close akkadotnet#2535 Close akkadotnet#2312 Close akkadotnet#3840 * implemented akka/akka#24167 * implemented akka/akka#22992

Aaronontheweb added confirmed bug akka-cluster-tools labels Sep 20, 2016

Aaronontheweb mentioned this issue Nov 30, 2016

PublishSubscribe.DistributedPubSubRestartMultiNode fails #2390

Open

alexvaluyskiy modified the milestones: 1.2.0, 1.3.0 Apr 6, 2017

alexvaluyskiy removed this from the 1.3.0 milestone Jul 18, 2017

Aaronontheweb mentioned this issue Jul 26, 2019

ClusterClient fixes #3866

Merged

Aaronontheweb closed this as completed in #3866 Jul 26, 2019

Aaronontheweb added a commit that referenced this issue Jul 26, 2019

ClusterClient fixes (#3866)

211db8e

Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes. Close #2535 Close #2312 Close #3840 * implemented akka/akka#24167 * implemented akka/akka#22992

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterClient Recepionist & Client failure detector race conditions #2312

ClusterClient Recepionist & Client failure detector race conditions #2312

Aaronontheweb commented Sep 20, 2016

Aaronontheweb commented Nov 1, 2016

Horusiath commented Jan 4, 2017 •

edited

Loading

SeanFarrow commented Jan 4, 2017

Horusiath commented Jan 4, 2017 via email

SeanFarrow commented Jan 4, 2017 via email

ClusterClient Recepionist & Client failure detector race conditions #2312

ClusterClient Recepionist & Client failure detector race conditions #2312

Comments

Aaronontheweb commented Sep 20, 2016

Aaronontheweb commented Nov 1, 2016

Horusiath commented Jan 4, 2017 • edited Loading

SeanFarrow commented Jan 4, 2017

Horusiath commented Jan 4, 2017 via email

SeanFarrow commented Jan 4, 2017 via email

Horusiath commented Jan 4, 2017 •

edited

Loading