Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestKit and Akka.Remote.TestKit: diagnostic improvements and code modernization #7321

Merged
merged 11 commits into from
Aug 19, 2024

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Aug 16, 2024

Changes

Based on some of the racy spec output we're seeing, it looks to me like there's some cases where the BarrierCoordinator might be creating race conditions where tests can inadvertently have multiple nodes crossing different barriers at the same time despite being programmed correctly. Investigating.

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

@Aaronontheweb Aaronontheweb added this to the 1.5.28 milestone Aug 16, 2024
@Aaronontheweb Aaronontheweb changed the title Akka.Remote.TestKit: resolving issues with BarrierCoordinator [WIP] TestKit and Akka.Remote.TestKit: diagnostic improvements and code modernization Aug 16, 2024
@Aaronontheweb Aaronontheweb added the akka-testkit Akka.NET Testkit issues label Aug 16, 2024
@Aaronontheweb Aaronontheweb marked this pull request as ready for review August 16, 2024 20:30
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't brave enough to address nullability enable inside the MNTR, but I thought this was an acceptable set of changes and improvements to help debug some of these tests.

@@ -37,7 +37,7 @@
<FsCheckVersion>2.16.6</FsCheckVersion>
<HoconVersion>2.0.3</HoconVersion>
<ConfigurationManagerVersion>6.0.1</ConfigurationManagerVersion>
<MultiNodeAdapterVersion>1.5.19</MultiNodeAdapterVersion>
<MultiNodeAdapterVersion>1.5.25</MultiNodeAdapterVersion>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade to latest MNTR version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to migrate as many tests as I could to use async Task and the async TestKit methods, since those tend to run faster and perform less thread-blocking. The rest of the changes in this file are just compilation fixes for the changes I made to the EnterBarrier / FailBarrier messages - which I'll describe in more detail on the files where those types are defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No substantive changes here other than leveraging the additional diagnostic data added to EnterBarrier and FailBarrier - most of this code has been touched since 2014. I just modernized it a bit to clear up some compiler warnings surrounding the use of .GetHashCode() and mutable values.

}

public sealed class Data
{
public Data(IEnumerable<Controller.NodeInfo> clients, string barrier, IEnumerable<IActorRef> arrived, Deadline deadline) :
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused CTOR - removed it.

public WrongBarrierException(string barrier, IActorRef client, Data barrierData)
: base($"[{client}] tried to enter '{barrier}' while we were waiting for '{barrierData.Barrier}'")
public WrongBarrierException(string barrier, IActorRef client, RoleName roleName, Data barrierData)
: base($"[{client}] [{roleName}] tried to enter '{barrier}' while we were waiting for '{barrierData.Barrier}'")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanded the description here of WHICH ROLE tried to enter the wrong barrier - this should make it much easier to debug tests in the future that have issues with nodes crossing barriers at inappropriate times.

{
Name = failBarrier.Name,
Op = Proto.Msg.EnterBarrier.Types.BarrierOp.Fail,
RoleName = failBarrier.Role.Name
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added RoleName encoding to FailBarrier

@@ -549,7 +549,7 @@ public bool IsNode(params RoleName[] nodes)
/// </summary>
public void EnterBarrier(params string[] name)
{
TestConductor.Enter(RemainingOr(TestConductor.Settings.BarrierTimeout), name.ToImmutableList());
TestConductor.Enter(RemainingOr(TestConductor.Settings.BarrierTimeout), Myself, name.ToImmutableList());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we call EnterBarrier during testing now, we always pass in Myself - which is set to the value of our role during the testing system.

}
}

/// <summary>
/// Enter the named barriers, one after the other, in the order given. Will
/// throw an exception in case of timeouts or other errors.
/// </summary>
public void Enter(string name)
public void Enter(RoleName roleName, string name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add RoleName support to the Enter methods for transmitting barrier information to the BarrierCoordinator

var stopped = Now + t;
if (stopped >= stop)
{
Sys.Log.Warning("AwaitAssert failed, timeout [{0}] is over after [{1}] attempts and [{2}] elapsed time", max, attempts, stopped - start);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diagnostic improvements to AwaitAssert - I wanted to understand how many times an assertion actually ran before it failed, so I capture that data and the total elapsed time the test used and pipe it into a Warning log right before we throw the assertion exception.

@@ -31,6 +31,7 @@ message EnterBarrier {
string name = 1;
BarrierOp op = 2;
int64 timeout = 3;
string roleName = 4;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added roleName to the wire format of the MNTR test conductor.

@Aaronontheweb
Copy link
Member Author

Even though I made some changes to type signatures, they're all internal, and the wire format changes are irrelevant because they only occur during testing (so everyone gets synced all at once.)

Copy link
Contributor

@Arkatufus Arkatufus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb merged commit 355439e into akkadotnet:dev Aug 19, 2024
7 of 12 checks passed
@Aaronontheweb Aaronontheweb deleted the mntr-barrier-cleanup branch August 19, 2024 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
akka-testkit Akka.NET Testkit issues multi node spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants