November Supplement

At 7:16 AM Central, on Saturday November 16th, Tabroom’s database server had this to say:

2024-11-16 13:16:00 0 [ERROR] InnoDB: tried to purge non-delete-marked record in index uk_ballots of table tabroom.ballot: tuple: TUPLE (info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}, record: COMPACT RECORD(info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}

Poof goes the ballots table.

A ballot in this context is a data record. One is created for every judge in a section for every entry in that section. So someone judging a single flight of debate would have two ‘ballots’; a three-judge panel in a room with six speech contestants would have 18 of them. Each one tracks what side the entry is on, what order they speak in, when that judge hit start, and whether the round is finished. All the points, wins and losses the judges hand out are stored below it; all the information about room assignments, scheduled times, the flip, and event type are above it. So, it’s a rather critical table. And it’s large: there are 16.1 million such records in Tabroom, making it the second largest.

At 7:16 CST this morning, Tabroom had to delete just one of those records. Maybe a judge needed to be replaced. Maybe a round was being re-paired and all the entries were dumped. Whatever the reason, a ballot was queued for deletion. That happens thousands of times on a Saturday. But in deleting that particular ballot, the database server software wobbled just a little bit. Perhaps it hit a very obscure bug. Perhaps it wrote the information on a part of the disk that has a small chemical flaw buried in its atoms, and so it failed. Or perhaps a cosmic ray hit the memory and flipped a zero to 1, and changed our world. However it happened, the table’s index was transformed to nonsense.

An index is a data structure used to speed up reading the database. If you ask for all the ballots in Section 2424101, the database server would have to scan all 16.1 million ballots in Tabroom to deliver the 12 you are looking for. That’s very slow on a large table. So for commonly accessed data, you create an index, which is a record in order of all the Section IDs in the Ballots table. The database finds the range you’re looking for quickly, and all 12 ballots IDs are listed there together.

But indexes aren’t free; you can’t just create them for every data element. Each one takes up space, increasing the disk size of the database. They also slow down writes: you have to update the index every time you create new data. So you only create them for data elements that you search by; the Section ID of a ballot yes, but the time of your speech, no.

That little glitch at 7:16 AM deleted the index records for that one doomed ballot, but not the ballot itself. Suddenly the number of rows in the index did not match the table. Therefore, the database stopped using it — it knew the index was no longer reliable. The slowdowns, lockups and downtime on Saturday morning is what it feels like to use ballots table without indexes: it starts out slow, and goes downhill from there.

First, I tried the gentle fix: a utility that tries to verify the data and rebuild just the indexes, which it does without any invasive changes to the data itself. If it succeeds, the database just starts working afterwards. It takes about 12 minutes to run on that large ballots table; a fact I learned this morning. And then, it failed. It can fail for a lot of reasons, but mostly it has a very hard time verifying data that is changing as it operates, which is what a live database must do.

So I had to turn to invasive procedures. What you do is cut off the ability of anyone to access the database, so nothing changes in the data in the middle of your surgery. Then you dump a backup copy of the table. Then you run the most scary command I’ve yet typed into a database:

DROP TABLE ballots;

That’s right, that deletes them all. And then I hope beyond hope that your backup data file is accurate and not itself corrupt. In reality, in my paranoia, I took four backups. Two of the primary database, and two of the replica. That took eight minutes, which included me comparing them against each other to make sure they were all identical. If they disagreed as to how many ballots exist, you then have to try to figure out, or then guess, which one was right. Today I was spared that.

Then I had to make a choice between the Right Way and the Fast Way to reload the data. Loading up the ballots data takes about 15 minutes. Deleting all the ballots takes about 0.15 seconds, and can’t be undone. So if I do a test run, your downtime is longer. If I don’t test it but the file is bad, then I’d have nothing to recover it from. In trying to shorten the downtime by 20 minutes, I would lengthen it by several hours.

So today, caution won. I took one of the backup files, and loaded it into my test machine. Simply copying the file took a few minutes, and then I got to sit there and watch as it re-created ballots in batches of about 8,500. All 2,000 of them. Each batch takes about 0.45 seconds to run on average, thus it was about 15 minutes of time total of just sitting and waiting as line after line of data was reloaded in, like this:

Query OK, 8770 rows affected (0.260 sec) Records: 8770 Duplicates: 0 Warnings: 0
Query OK, 8594 rows affected (0.272 sec) Records: 8594 Duplicates: 0 Warnings: 0
Query OK, 8927 rows affected (0.270 sec) Records: 8927 Duplicates: 0 Warnings: 0

It’s a real fun thing when you are sitting and waiting and can do nothing while you know everyone else is doing the same. But eventually, it worked. And so I braced, dumped the real database’s ballots, and ran it again.

And phew, it was fine.

The site came back immediately, though naturally was a bit slow at first because then EVERYONE was pairing their first round at once, which is far from typical. But that worked itself out fast.

And then I got to clean up the resulting mess. I have 31 terminals windows open with full access to the entire database — better not typo in any of those! I spun up a bunch of servers to get spare capacity going, and found a different, less grievous bug in the process with that — but that thankfully was Hardy’s fault, and so I shoved if off on him. And then of course, you know how I get email every time that error message screen happens? I got to clear out the 92,822 error reports that were queued up in the email server before any other messages would send.

And then I wrote this post.

After a downtime, you want to take apart the causes and figure out how to make it not happen again that way. The last year’s downtimes were all capacity related; we had too few resources for too many users. It was mostly me figuring out how powerful our new cloud system was, and sometimes wasn’t. We also lacked a system that could quickly bring new resources online when I guessed short. I spent a fair chunk of August building a system to help; now it takes me about 5 minutes to spin up new servers, instead of an hour. So, neither of our episodes this fall were caused by that.

The one in October was in the category of “my fault, preventable, but super unlucky.” It’s the type of thing where there does exist a level of care that might have prevented it. But practically speaking that level of care would also paralyze me if I adopted it; I would do nothing else if I were that fanatic about validating code and queries. So instead, I created some automated systems to check for slow queries during the week and notify me, to try to find these issues before they explode. That system has already ferreted out a number of annoyingly — but not tragically — slow functions. These things only blow up when there’s hundreds of them running at once, but if they don’t exist at all, then that will never happen instead of rarely happening.

Today’s episode was worse: there’s no way for me to prevent errors deep in the underlying database code. I have neither control nor capacity to address it. I will probably schedule a very early morning downtime in the next week or so — or maybe over the Thanksgiving break — to do a full rebuild of all the database tables, and to deep scan the disk they live on. That’s worth doing anyway; rebuilding the tables gets rid of empty spaces that once held deleted records, and makes the whole thing run a few percent faster.

And I might just move all the data onto a new disk altogether. That’s proactive and reduces some risk, but the truth is I might be chasing chimeras. And that’s life with computing. Technical complexity can cause a lot of grief. Human error causes even more. And sometimes, it’s neither; it’s just the stars decided today is not your day, and you’re going to know it.

And such a day was Saturday the 16th of November. It’s been many years since we’ve had a problem of this particular flavor; may it be many years before we have another.