Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool does not detect killed connections (from MariaDB) #340

Closed
rspreen opened this issue Sep 14, 2018 · 3 comments · Fixed by #660
Closed

Pool does not detect killed connections (from MariaDB) #340

rspreen opened this issue Sep 14, 2018 · 3 comments · Fixed by #660
Labels

Comments

@rspreen
Copy link

rspreen commented Sep 14, 2018

Problem: The aiomysql.Pool cannot detect certain situations when MariaDB breaks connections from the server side, and the pool will permanently continue to hand out (via acquire()) these broken connections, even though each use of that connection will cause PyMySQL exceptions (either OperationalError or InternalError).

This situation occurs whenever MariaDB (version 10.2.7 and presumably others) is stopped or restarted and is holding current connections to an active aiomysql Connection Pool. (We see this both on MacOS and Centos VMs.)

Severity: This situation is toxic to any application, and is not simple to code around at the application level.

Underlying Reason: In some situations when MariaDB kills its connections (specifically, when it shuts down), it sends a MySQL error packet across the connection before it closes the connection. As a result, in the aiomysql client, the pooled connection's StreamReader (the "_reader" field, which is an asyncio.StreamReader) first gets a "feed_data()" call with 34 bytes that make up a proper MySQL error packet, and those bytes are appended to its "_buffer" field; then, immediately after, the reader gets a "feed_eof()" call, which sets its "_eof" field to true (i.e., indicating that its stream has been disconnected).

Note that, at this point, the reader's "_eof" field is true, and its "_buffer" field is not empty.

When an application then calls pool.acquire(), internally this calls pool._fill_free_pool(), which tests each connection in the pool, discarding it if "connection._reader.at_eof()" is true.

But here's the problem: "connection._reader.at_eof()" does not return True! Thus, the pool keeps this connection, thinking it to be valid, and continues to hand it out for use.

The problem is that the method asyncio.StreamReader.at_eof() says "self._eof and not self._buffer". There appears to be no way for a user of a StreamReader to tell that an EOF has been detected until all the data is read from its buffer, but no one ever reads data from an unused, pooled connection sitting in the "_free" list - and why would you?

Solution Discussion

This would seem to be easy to fix, but it's not.

The simple but inappropriate hack solution is to change pool._fill_free_pool() to directly test the "conn._reader._eof" flag instead of calling the "conn._reader.at_eof()" method. Obviously it seems wrong to read an internal/private field in another library's class, but I would claim that this is a problem with the asyncio.StreamReader API, since there is no way to detect this situation independently of the buffer not being empty.

And making it even harder to work around properly, the StreamReader API is also problematic in that it does not provide a non-blocking way to read or empty that buffer, and you cannot do a blocking read in "_fill_free_pool()" which is simply trying to weed out broken connections from the good ones. Ultimately, this is a deficiency in asyncio.StreamReader that is causing aiomysql to be impossible to use robustly, but I don't know how to get StreamReader fixed.

[Additional discovery by original author: more on the bad connection test]

Since my original posting, I have hit upon another situation where the pool does not detect that it is using a connection that it shouldn't: the case of a closed connection being in the pool. This can happen if for some reason the connection is released by the server while the connection is in use. This can occur, for example, if the MySQL server's "net_read_timeout" expires, which happens if not all the data from the client application arrives in a timely fashion (and can be caused by a flakey/faulty network connection). This connection drop occurs by definition when the connection is "in use", not sitting "free" in the pool. When this occurs, the connection gets closed, and the pool does not detect it. This makes my suggested fix above (directly checking "conn._reader._eof") incorrect, as "_reader" can be None. I believe the better test is:

        if conn.closed or conn._reader._eof:

Related cases

PyMySQL dealt with this exact MariaDB "error packet" situation in PyMySQL case #526, so this is not a unique problem to MySQL drivers, even though the problem manifests itself differently within aiomysql.

I believe this may be the same underlying problem of case 132, "Pool auto reconnect", which was addressed by adding the "recycle" feature to time out connections from the client side - but this is only a band-aid that tries to avoid the actual problem, by timing connections out on the client side before mariadb does it from the server side (when they exceed wait_timeout). This doesn't solve the problem that occurs when you restart the database server.

This is also the underlying problem that is causing case 339 - "Don't reuse connections when StreamReader has an exception". That author's proposed solution of culling a connection if it has an exception is a good idea in itself, but it's only limiting the damage rather than solving the problem - instead of acquiring a toxic connection repeatedly and permanently, you'll only get it once (per connection). It means applications will still suffer unnecessary un-handleable database errors, which would not occur if aiomysql could detect that _reader._eof is true.

Final note:

In case you wonder, the error packet that come across the wire from MariaDB is bytearray(b'\x1e\x00\x00\x00\xff\x87\x07#70100Connection was killed'). This is a correctly-formatted MySQL packet; notice that the sequence number field is zero, which is why one manifestation of this bug is that when you try to use that broken connection, you'll get a pymysql.err.InternalError, "Packet sequence number wrong - got 0 expected 1". Basically, your query goes nowhere when you send it out the StreamWriter, but then you read this already-sitting-in-the-read-buffer error packet out of the StreamReader as your reply.

@huangyuefeng
Copy link

Concern about this problem.

@pedrokiefer
Copy link

I was reading the code from mysql-connector-python and all the methods that write something to the socket have a call to handle_unread_result. So maybe the pool could test if the connection has a pending buffer to be read, read it, check if it's a Connection was killed packet. After that the connection would be close, and the pool would reconnect / acquire a new available connection.

There are others refactors that we could do to improve aiomysql, such as moving the low level packet handling to a custom transport. Then we would have access to eof_received and could expose it in a better way to Connection.

@TimothyFitz
Copy link
Contributor

A short term solution, although it looks like a better fix is on the way, would be to amend my change #339 and also check conn._writer.transport.is_closing() which will be true immediately after connection_lost is called, sidestepping the limitation of StreamReaders.

We've been monkeypatching my fix in production (until 0.0.20 drops), you could easily adapt it to handle your case until the real fix is released: https://gist.github.com/TimothyFitz/bfbc43260c8cc2d3442dbbaac983e8b2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants