Pool does not detect killed connections (from MariaDB) #340

rspreen · 2018-09-14T00:06:45Z

Problem: The aiomysql.Pool cannot detect certain situations when MariaDB breaks connections from the server side, and the pool will permanently continue to hand out (via acquire()) these broken connections, even though each use of that connection will cause PyMySQL exceptions (either OperationalError or InternalError).

This situation occurs whenever MariaDB (version 10.2.7 and presumably others) is stopped or restarted and is holding current connections to an active aiomysql Connection Pool. (We see this both on MacOS and Centos VMs.)

Severity: This situation is toxic to any application, and is not simple to code around at the application level.

Underlying Reason: In some situations when MariaDB kills its connections (specifically, when it shuts down), it sends a MySQL error packet across the connection before it closes the connection. As a result, in the aiomysql client, the pooled connection's StreamReader (the "_reader" field, which is an asyncio.StreamReader) first gets a "feed_data()" call with 34 bytes that make up a proper MySQL error packet, and those bytes are appended to its "_buffer" field; then, immediately after, the reader gets a "feed_eof()" call, which sets its "_eof" field to true (i.e., indicating that its stream has been disconnected).

Note that, at this point, the reader's "_eof" field is true, and its "_buffer" field is not empty.

When an application then calls pool.acquire(), internally this calls pool._fill_free_pool(), which tests each connection in the pool, discarding it if "connection._reader.at_eof()" is true.

But here's the problem: "connection._reader.at_eof()" does not return True! Thus, the pool keeps this connection, thinking it to be valid, and continues to hand it out for use.

The problem is that the method asyncio.StreamReader.at_eof() says "self._eof and not self._buffer". There appears to be no way for a user of a StreamReader to tell that an EOF has been detected until all the data is read from its buffer, but no one ever reads data from an unused, pooled connection sitting in the "_free" list - and why would you?

Solution Discussion

This would seem to be easy to fix, but it's not.

The simple but inappropriate hack solution is to change pool._fill_free_pool() to directly test the "conn._reader._eof" flag instead of calling the "conn._reader.at_eof()" method. Obviously it seems wrong to read an internal/private field in another library's class, but I would claim that this is a problem with the asyncio.StreamReader API, since there is no way to detect this situation independently of the buffer not being empty.

And making it even harder to work around properly, the StreamReader API is also problematic in that it does not provide a non-blocking way to read or empty that buffer, and you cannot do a blocking read in "_fill_free_pool()" which is simply trying to weed out broken connections from the good ones. Ultimately, this is a deficiency in asyncio.StreamReader that is causing aiomysql to be impossible to use robustly, but I don't know how to get StreamReader fixed.

[Additional discovery by original author: more on the bad connection test]

Since my original posting, I have hit upon another situation where the pool does not detect that it is using a connection that it shouldn't: the case of a closed connection being in the pool. This can happen if for some reason the connection is released by the server while the connection is in use. This can occur, for example, if the MySQL server's "net_read_timeout" expires, which happens if not all the data from the client application arrives in a timely fashion (and can be caused by a flakey/faulty network connection). This connection drop occurs by definition when the connection is "in use", not sitting "free" in the pool. When this occurs, the connection gets closed, and the pool does not detect it. This makes my suggested fix above (directly checking "conn._reader._eof") incorrect, as "_reader" can be None. I believe the better test is:

        if conn.closed or conn._reader._eof:

Related cases

PyMySQL dealt with this exact MariaDB "error packet" situation in PyMySQL case #526, so this is not a unique problem to MySQL drivers, even though the problem manifests itself differently within aiomysql.

I believe this may be the same underlying problem of case 132, "Pool auto reconnect", which was addressed by adding the "recycle" feature to time out connections from the client side - but this is only a band-aid that tries to avoid the actual problem, by timing connections out on the client side before mariadb does it from the server side (when they exceed wait_timeout). This doesn't solve the problem that occurs when you restart the database server.

This is also the underlying problem that is causing case 339 - "Don't reuse connections when StreamReader has an exception". That author's proposed solution of culling a connection if it has an exception is a good idea in itself, but it's only limiting the damage rather than solving the problem - instead of acquiring a toxic connection repeatedly and permanently, you'll only get it once (per connection). It means applications will still suffer unnecessary un-handleable database errors, which would not occur if aiomysql could detect that _reader._eof is true.

Final note:

In case you wonder, the error packet that come across the wire from MariaDB is bytearray(b'\x1e\x00\x00\x00\xff\x87\x07#70100Connection was killed'). This is a correctly-formatted MySQL packet; notice that the sequence number field is zero, which is why one manifestation of this bug is that when you try to use that broken connection, you'll get a pymysql.err.InternalError, "Packet sequence number wrong - got 0 expected 1". Basically, your query goes nowhere when you send it out the StreamWriter, but then you read this already-sitting-in-the-read-buffer error packet out of the StreamReader as your reply.

The text was updated successfully, but these errors were encountered:

huangyuefeng · 2018-09-18T05:07:50Z

Concern about this problem.

pedrokiefer · 2018-11-21T19:31:36Z

I was reading the code from mysql-connector-python and all the methods that write something to the socket have a call to handle_unread_result. So maybe the pool could test if the connection has a pending buffer to be read, read it, check if it's a Connection was killed packet. After that the connection would be close, and the pool would reconnect / acquire a new available connection.

There are others refactors that we could do to improve aiomysql, such as moving the low level packet handling to a custom transport. Then we would have access to eof_received and could expose it in a better way to Connection.

TimothyFitz · 2018-12-06T15:13:23Z

A short term solution, although it looks like a better fix is on the way, would be to amend my change #339 and also check conn._writer.transport.is_closing() which will be true immediately after connection_lost is called, sidestepping the limitation of StreamReaders.

We've been monkeypatching my fix in production (until 0.0.20 drops), you could easily adapt it to handle your case until the real fix is released: https://gist.github.com/TimothyFitz/bfbc43260c8cc2d3442dbbaac983e8b2

pedrokiefer mentioned this issue Nov 20, 2018

refactor: close connection on error #358

Open

Nothing4You mentioned this issue Jan 12, 2022

Fix MySQL 8.0 tests, properly close timed out connections #660

Merged

4 tasks

Nothing4You added the bug label Jan 13, 2022

Nothing4You closed this as completed in #660 Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool does not detect killed connections (from MariaDB) #340

Pool does not detect killed connections (from MariaDB) #340

rspreen commented Sep 14, 2018 •

edited

Loading

huangyuefeng commented Sep 18, 2018

pedrokiefer commented Nov 21, 2018

TimothyFitz commented Dec 6, 2018

Pool does not detect killed connections (from MariaDB) #340

Pool does not detect killed connections (from MariaDB) #340

Comments

rspreen commented Sep 14, 2018 • edited Loading

huangyuefeng commented Sep 18, 2018

pedrokiefer commented Nov 21, 2018

TimothyFitz commented Dec 6, 2018

rspreen commented Sep 14, 2018 •

edited

Loading