-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterClient Recepionist & Client failure detector race conditions #2312
Comments
Just for clarification, it's https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools.Tests.MultiNode/ClusterClient/ClusterClientSpec.cs#L520 that fails on Node 1. in the first scenario And it's https://github.com/akkadotnet/akka.net/blob/dev/src/contrib/cluster/Akka.Cluster.Tools.Tests.MultiNode/ClusterClient/ClusterClientSpec.cs#L448 that fails on the second scenario on Node 4. |
After investigating the issue with Sean, I think that possible reason is that ClusterRecipient doesn't get notified with From what you've shown it looks like actual list of contact has one more entry than the expected one (a node listening on a port 53084). However from logs (search for phrase Leader is auto-downing unreachable node) it looks like that node is never downed for some reason. |
Do you have any idea what the reason this node isn't autodowned is? Is there something we can do with the unreachable event? |
This is strange, while leader is downing a node on the different port,
ultimately a correct node gets downed. Maybe it's bug in the logging
itself. This is from logs:
[WARNING][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080]
- Marking node(s) as UNREACHABLE [Member(address =
akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084, status = Up,
role=[], upNumber=1)]. Node roles []
[INFO][9/20/2016 12:59:02 AM][Thread 0022][Cluster] Cluster Node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Leader is
auto-downing unreachable node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080]
[DEBUG][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
[Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
Marking unreachable node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084] as [Down]
2017-01-04 13:30 GMT+01:00 Sean Farrow <notifications@github.com>:
… Do you have any idea what the reason this node isn't autodowned is? Is
there something we can do with the unreachable event?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2312 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApfGTXd3EozGq0bL-1Ny-I8w9xavLbiks5rO5DwgaJpZM4KBLpz>
.
|
Ok, can you add to the issue, also, if this is the case what is causing the test to fail?
From: Bartosz Sypytkowski [mailto:notifications@github.com]
Sent: Wednesday, January 04, 2017 12:39
To: akkadotnet/akka.net <akka.net@noreply.github.com>
Cc: Sean Farrow <sean.farrow@seanfarrow.co.uk>; Comment <comment@noreply.github.com>
Subject: Re: [akkadotnet/akka.net] ClusterClient Recepionist & Client failure detector race conditions (#2312)
This is strange, while leader is downing a node on the different port,
ultimately a correct node gets downed. Maybe it's bug in the logging
itself. This is from logs:
[WARNING][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
Cluster Node [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080]
- Marking node(s) as UNREACHABLE [Member(address =
akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084, status = Up,
role=[], upNumber=1)]. Node roles []
[INFO][9/20/2016 12:59:02 AM][Thread 0022][Cluster] Cluster Node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080] - Leader is
auto-downing unreachable node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080]
[DEBUG][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
[Initialized] Received Akka.Cluster.ClusterUserAction+Down
[INFO][9/20/2016 12:59:02 AM][Thread
0022][[akka://MultiNodeClusterSpec/system/cluster/core/daemon#1317598395]]
Marking unreachable node
[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084] as [Down]
2017-01-04 13:30 GMT+01:00 Sean Farrow <notifications@github.com>:
Do you have any idea what the reason this node isn't autodowned is? Is
there something we can do with the unreachable event?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2312 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApfGTXd3EozGq0bL-1Ny-I8w9xavLbiks5rO5DwgaJpZM4KBLpz>
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#2312 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABY1fk4a4Jc0WQc5vcOSIcA2ZCXpgfTdks5rO5L1gaJpZM4KBLpz>.
|
Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes. Close #2535 Close #2312 Close #3840 * implemented akka/akka#24167 * implemented akka/akka#22992
Fixed broker `IComparer` for ClusterClient hash ring and ported over other handoff fixes. Close akkadotnet#2535 Close akkadotnet#2312 Close akkadotnet#3840 * implemented akka/akka#24167 * implemented akka/akka#22992
Been able to verify the existence of a bug with two different sets of logs from the
Akka.Cluster.Tools.MultiNode.ClusterClientSpec
, which I've attached via zip file here.Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 1.zip
Akka.Cluster.Tools.Tests.MultiNode.Client.ClusterClientMultiNode - failure set 2.zip
There's a possibility that this could be a bug with the spec itself, but I'm skeptical of that given that lengthy periods of time that this error spans and the fact that the error occurs in both directions:
Failure set 1 reveals that the ClusterClient subscribers clients are not notified on-time that a receptionist has gone down.
On node 1, the client, fails because
[Node1][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node1][FAIL-EXCEPTION] Message: Expected collection {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53084/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist} to be equivalent to {akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53083/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53080/system/receptionist, akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53082/system/receptionist}, but it contains too many items.
The receptionist at 53084 was terminated earlier in the spec and we verified that it was terminated, yet the subscriber actor on the client never received a notification despite the fact that roughly 15-20 seconds elapsed between the node being terminated and this assertion running out of time. I checked that in the logs. That indicates that this is an unsafe or unaccounted for failed write somewhere inside the
ClusterClient
and not with the spec.Failure 2 is from a different test run, and it indicates this same problem in reverse: that the receptionist's subscribers aren't notified about a client becoming unreachable even after a long period of time elapsing.
Similar issue as before:
[Node4][FAIL-EXCEPTION] Type: Xunit.Sdk.XunitException
--> [Node4][FAIL-EXCEPTION] Message: Expected collection {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129], [akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client1#1210653891]} to be equivalent to {[akka.trttl.gremlin.tcp://MultiNodeClusterSpec@localhost:53454/user/client#986553129]}, but it contains too many items.
client1
was terminated two test methods ago duringClusterClient_must_communicate_to_any_node_in_cluster
, and according to the logs roughly 15 seconds have elapsed since theContext.Stop
call toclient1
and this assertion failing. This would indicate again that somehow the state for at least one receptionist isn't being updated consistently.Any ideas what the comment element is between these two bugs? cc @alexvaluyskiy @Horusiath
The text was updated successfully, but these errors were encountered: