Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka.Remote: don't log aborted connection as disassociation error #4101

Merged
merged 2 commits into from
Dec 20, 2019

Conversation

Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Dec 16, 2019

In some instances with Akka.Remote, even during a normal graceful shutdown where a node terminates its ActorSystem, it's still possible to get errors such as

[ERROR][12.16.2019 13:50:26][Thread 0021][Akka.Remote.Transport.DotNetty.TcpServerHandler] Error caught channel [[::ffff:127.0.0.1]:56060->[::ffff:127.0.0.1]:51761](Id=73c17098)
Cause: System.Net.Sockets.SocketException (0x80004005): An established connection was aborted by the software in your host machine
   at DotNetty.Transport.Channels.Sockets.TcpSocketChannel.DoReadBytes(IByteBuffer byteBuf)
   at DotNetty.Transport.Channels.Sockets.AbstractSocketByteChannel.SocketByteChannelUnsafe.FinishRead(SocketChannelAsyncOperation operation)

I spent several hours looking into this today, trying to understand whether this was a bug with DotNetty, Akka.Remote, or some of our networking settings. The conclusion I came to is that ultimately, an aborted connection is something that is possible during the following scenario, as best described by the Oracle documentation for TCP sockets in Java:

Another common scenario, which may result in an unintended SocketException, is the following: Say A has sent data to B, but B closes the socket without reading all the data. In this situation, B’s TCP stack knows that some data is effectively lost and it will forcibly terminate with RST rather than use the orderly FIN procedure. A will get a SocketException if it then tries to send or receive data from the socket.

In the case of DotNetty, it follows the best practices for graceful socket closure and calls Socket.Shutdown(SocketShutdown.Both) prior to disposing the socket:

https://github.com/Azure/DotNetty/blob/47f5ec7303037f1360615d182939e04d8619a2b3/src/DotNetty.Transport/Channels/Sockets/TcpSocketChannel.cs#L169-L186

So we are following the best practices when we terminate an ActorSystem and shut down our Akka.Remote connections - we wait for all of the currently open connections to terminate, which invokes this code:

public override async Task<bool> Shutdown()
{
try
{
var tasks = new List<Task>();
foreach (var channel in ConnectionGroup)
{
tasks.Add(channel.CloseAsync());
}
var all = Task.WhenAll(tasks);
await all.ConfigureAwait(false);
var server = ServerChannel?.CloseAsync() ?? TaskEx.Completed;
await server.ConfigureAwait(false);
return all.IsCompleted && server.IsCompleted;
}
finally
{
// free all of the connection objects we were holding onto
ConnectionGroup.Clear();
#pragma warning disable 4014 // shutting down the worker groups can take up to 10 seconds each. Let that happen asnychronously.
_clientEventLoopGroup.ShutdownGracefullyAsync();
_serverEventLoopGroup.ShutdownGracefullyAsync();
#pragma warning restore 4014
}
}

However, it is still possible for a node ("node A") that is terminating to receive a message (from "node B") on its socket after the Socket.Shutdown and Socket.Close / Socket.Dispose calls have been made. This will result in a "connection aborted exception" thrown on node B, because node A tells node B that "I can't process this message because I've already shutdown" and then sends a TCP signal indicating an abortive shutdown.

So we can only receive this type of error message when a socket is in the process of shutting down anyway - in the event of a true network outage, such as a failing piece of network hardware, the exception thrown by the socket will be a SocketError.TimedOut or SocketError.ConnectionRefused in the event of trying to connect to an unreachable address.

Therefore, we should handle SocketError.ConnectionAborted as though it were part of the shutdown process and just log it without barfing up a ton of disassocation error messages - because that's what it really means.

@Aaronontheweb
Copy link
Member Author

I'm going to leave this issue open for discussion if anyone feels strongly that we should or should not do this.

@Aaronontheweb Aaronontheweb added this to the 1.4.0 milestone Dec 20, 2019
@Aaronontheweb Aaronontheweb merged commit 4eb5a5d into akkadotnet:dev Dec 20, 2019
@Aaronontheweb Aaronontheweb deleted the dontfail-connection-abort branch December 20, 2019 02:17
Aaronontheweb added a commit to Aaronontheweb/akka.net that referenced this pull request Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant