December 2024

TABROOM

Had a short hiccup on Friday night; it was largely because I’d spun up extra capacity but the process of spinning it up didn’t quite finish, alas, for a really stupid reason.  But fortunately it was also quick to fix; the process to get around the stupid reason was fast, and we were back after like 6-7 minutes.  I can’t promise never to have issues, after all; nobody can in tech.  But it’s nicely affirming when the problem is just a little turbulence instead of a full plane crash, especially when I recovered so fast as a direct consequence of some blood sweat & tears I’ve recently put in.

That brings me to a wider point about the rewriting process and the concept of a feature freeze.  I’m trying to not code up new material in the old programming environment as much as possible, except for direct bug fixes and flaws, while I get the infrastructure rolling behind the new framework. However, to some degree that is impossible.  Tabroom’s reality is constantly changing, because you all keep using it.

Even if I never add another feature and only fix bugs and errors, Tabroom must change, simply because the scale increases. We get more traffic year over year, more tournaments, more students, and every one of our end participants uses the tech more heavily too; we bring three devices per person to tournaments now. That growing load represents unavoidable change permanently baked into Tabroom, that will always demands a measure of attention.  Software in active use can never be paused.

So after our fun times in November I combed through our records of moments the database locked up with heavy write traffic, and rewrote every page and query that featured there to avoid them. The big one was the pref entry screen.  Did that cause our corrupted index?  I’ll never know. But it will perhaps make them less likely in the future, and it will definitely make parts of the site run faster and better.

The expanded load also means the software is more unforgiving.  Smaller mistakes become big problems. To a degree the expansion of Tabroom represents an expansion of the world of forensics.  This is good!  But it does demand I keep up with it, so I’ll never be able to entirely focus on the rewrite.

But all the same, I’ve made some good progress there. One big advantage of the new framework is it runs a lot faster on less powerful hardware.  The other big win is that I’m a far better coder than I was twenty years ago; the code I put out after rewriting will be more robust and capable.  I can already feel the system reaping the benefits of both of those things.  I’m not traveling at all in December, either for myself or tournaments, after this weekend.  I’m hoping I can use the stillness to hunker down and turn the corner; I’ve seen its edge, so we are perhaps near to seeing some reality there.

THE OTHERS AROUND

My sister got a new gig already after the old one had a round of layoffs; the new one seems much more comfortable and promising, though, so props to her for landing so quickly.  I continue to have some pretty phenomenal nephews and nieces, as finding things they’ll like during my travels has confirmed for me.  But then, I am somewhat biased.

JUST LITTLE OL’ ME

Welcome to the Holiday Season, such as it is.  I confess a dearth of conventional Christmas spirit, and generally I try to avoid traditional observance of the holidays. For one, Christmas really my dad’s holiday; he always made it a big deal, and since we lost him some 13 or so years ago, his passion for the day adds a tang to the holiday that I find it better to avoid. Don’t take up smoking, kids.

And as it happens, December 25th is the least common birthday of the year, but it was still the birthday of Humphrey Bogart, Jimmy Buffet, Sissy Spacek, Rickey Henderson, and your humble Tabroom programmer.  I therefore prefer to spend the day away from indoor trees and too much rib roast. Instead I go off and find someplace quiet with more outdoor trees.  It works for me.

My European Gallivant was lovely for the most part. I found Munich warm, comfortable and welcoming. Venice was fun as always especially for me having company there — I’ve never actually traveled with people in Europe before, and they spurred me into seeing and doing things I’d not usually find on my own, such as a performance of Verdi’s Otello at the iconic La Fenice opera house. I confess I’m not much of an opera person, despite loving classical music generally.   But it was still great to go if only the once.  And then I swung through Amsterdam for some time in coffee shops reading tech docs, rijsttafel, and cloudy skies.

I confess however that travel to Random European Cities has grown both easy for me, but also less interesting and exotic. I found myself walking around places feeling more at home, but less engaged by them as interesting in their own right as a result.  I intend to focus my wanderings to more rural places and probably further afield in the days to come.

November Supplement

At 7:16 AM Central, on Saturday November 16th, Tabroom’s database server had this to say:

2024-11-16 13:16:00 0 [ERROR] InnoDB: tried to purge non-delete-marked record in index uk_ballots of table tabroom.ballot: tuple: TUPLE (info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}, record: COMPACT RECORD(info_bits=0, 4 fields): {NULL,[4] \ (0x805CEFD5),[4] u (0x807514D1),[4] ZO(0x82B95A4F)}

Poof goes the ballots table.

A ballot in this context is a data record.  One is created for every judge in a section for every entry in that section.  So someone judging a single flight of debate would have two ‘ballots’; a three-judge panel in a room with six speech contestants would have 18 of them.  Each one tracks what side the entry is on, what order they speak in, when that judge hit start, and whether the round is finished. All the points, wins and losses the judges hand out are stored below it; all the information about room assignments, scheduled times, the flip, and event type are above it. So, it’s a rather critical table.  And it’s large: there are 16.1 million such records in Tabroom, making it the second largest.

At 7:16 CST this morning, Tabroom had to delete just one of those records. Maybe a judge needed to be replaced. Maybe a round was being re-paired and all the entries were dumped. Whatever the reason, a ballot was queued for deletion.  That happens thousands of times on a Saturday. But in deleting that particular ballot, the database server software wobbled just a little bit. Perhaps it hit a very obscure bug. Perhaps it wrote the information on a part of the disk that has a small chemical flaw buried in its atoms, and so it failed. Or perhaps a cosmic ray hit the memory and flipped a zero to 1, and changed our world. However it happened, the table’s index was transformed to nonsense.

An index is a data structure used to speed up reading the database. If you ask for all the ballots in Section 2424101, the database server would have to scan all 16.1 million ballots in Tabroom to deliver the 12 you are looking for. That’s very slow on a large table. So for commonly accessed data, you create an index, which is a record in order of all the Section IDs in the Ballots table. The database finds the range you’re looking for quickly, and all 12 ballots IDs are listed there together.

But indexes aren’t free; you can’t just create them for every data element. Each one takes up space, increasing the disk size of the database. They also slow down writes: you have to update the index every time you create new data. So you only create them for data elements that you search by; the Section ID of a ballot yes, but the time of your speech, no.

That little glitch at 7:16 AM deleted the index records for that one doomed ballot, but not the ballot itself.  Suddenly the number of rows in the index did not match the table. Therefore, the database stopped using it — it knew the index was no longer reliable. The slowdowns, lockups and downtime on Saturday morning is what it feels like to use ballots table without indexes: it starts out slow, and goes downhill from there.

First, I tried the gentle fix: a utility that tries to verify the data and rebuild just the indexes, which it does without any invasive changes to the data itself. If it succeeds, the database just starts working afterwards. It takes about 12 minutes to run on that large ballots table; a fact I learned this morning. And then, it failed.  It can fail for a lot of reasons, but mostly it has a very hard time verifying data that is changing as it operates, which is what a live database must do.

So I had to turn to invasive procedures. What you do is cut off the ability of anyone to access the database, so nothing changes in the data in the middle of your surgery. Then you dump a backup copy of the table. Then you run the most scary command I’ve yet typed into a database:

DROP TABLE ballots;

That’s right, that deletes them all. And then I hope beyond hope that your backup data file is accurate and not itself corrupt. In reality, in my paranoia, I took four backups. Two of the primary database, and two of the replica. That took eight minutes, which included me comparing them against each other to make sure they were all identical. If they disagreed as to how many ballots exist, you then have to try to figure out, or then guess, which one was right.  Today I was spared that.

Then I had to make a choice between the Right Way and the Fast Way to reload the data. Loading up the ballots data takes about 15 minutes. Deleting all the ballots takes about 0.15 seconds, and can’t be undone. So if I do a test run, your downtime is longer. If I don’t test it but the file is bad, then I’d have nothing to recover it from. In trying to shorten the downtime by 20 minutes, I would lengthen it by several hours.

So today, caution won. I took one of the backup files, and loaded it into my test machine. Simply copying the file took a few minutes, and then I got to sit there and watch as it re-created ballots in batches of about 8,500. All 2,000 of them. Each batch takes about 0.45 seconds to run on average, thus it was about 15 minutes of time total of just sitting and waiting as line after line of data was reloaded in, like this:

Query OK, 8770 rows affected (0.260 sec)
Records: 8770 Duplicates: 0 Warnings: 0


Query OK, 8594 rows affected (0.272 sec)
Records: 8594 Duplicates: 0 Warnings: 0


Query OK, 8927 rows affected (0.270 sec)
Records: 8927 Duplicates: 0 Warnings: 0

It’s a real fun thing when you are sitting and waiting and can do nothing while you know everyone else is doing the same. But eventually, it worked. And so I braced, dumped the real database’s  ballots, and ran it again.

And phew, it was fine.

The site came back immediately, though naturally was a bit slow at first because then EVERYONE was pairing their first round at once, which is far from typical. But that worked itself out fast.

And then I got to clean up the resulting mess.  I have 31 terminals windows open with full access to the entire database — better not typo in any of those!  I spun up a bunch of servers to get spare capacity going, and found a different, less grievous bug in the process with that — but that thankfully was Hardy’s fault, and so I shoved if off on him.  And then of course, you know how I get email every time that error message screen happens? I got to clear out the 92,822 error reports that were queued up in the email server before any other messages would send.

And then I wrote this post.

After a downtime, you want to take apart the causes and figure out how to make it not happen again that way. The last year’s downtimes were all capacity related; we had too few resources for too many users. It was mostly me figuring out how powerful our new cloud system was, and sometimes wasn’t. We also lacked a system that could quickly bring new resources online when I guessed short. I spent a fair chunk of August building a system to help; now it takes me about 5 minutes to spin up new servers, instead of an hour.  So, neither of our episodes this fall were caused by that.

The one in October was in the category of “my fault, preventable, but super unlucky.” It’s the type of thing where there does exist a level of care that might have prevented it. But practically speaking that level of care would also paralyze me if I adopted it; I would do nothing else if I were that fanatic about validating code and queries. So instead, I created some automated systems to check for slow queries during the week and notify me, to try to find these issues before they explode. That system has already ferreted out a number of annoyingly — but not tragically — slow functions. These things only blow up when there’s hundreds of them running at once, but if they don’t exist at all, then that will never happen instead of rarely happening.

Today’s episode was worse: there’s no way for me to prevent errors deep in the underlying database code. I have neither control nor capacity to address it. I will probably schedule a very early morning downtime in the next week or so — or maybe over the Thanksgiving break — to do a full rebuild of all the database tables, and to deep scan the disk they live on. That’s worth doing anyway; rebuilding the tables gets rid of empty spaces that once held deleted records, and makes the whole thing run a few percent faster.

And I might just move all the data onto a new disk altogether. That’s proactive and reduces some risk, but the truth is I might be chasing chimeras. And that’s life with computing. Technical complexity can cause a lot of grief. Human error causes even more. And sometimes, it’s neither; it’s just the stars decided today is not your day, and you’re going to know it.

And such a day was Saturday the 16th of November. It’s been many years since we’ve had a problem of this particular flavor; may it be many years before we have another.

November 2024

TABROOM

The rewrite continues apace.  I still feel stuck in the cellar, as it were, working on foundational framework issues instead of the ooh-pretty of an actual page.  But building software is often like making a building; the stage where it looks like a giant pit of dirt seems to go on forever, but once you see a support beam sticking in the air, it starts looking like a real building very quickly after that.   Or at least, so I hope.

I had a fun time today upgrading the ol’ laptop to the latest version of the Linux distro I run.  In the process, somehow this version of Chromium combined with my particular Radeon drivers inverted all the colors of all pages and plugins inside the browser.  I mostly use Firefox anyway, but still.

I gather the Reddits have noted that my logo-hover Easter egg is feeling dumpy.  One enterprising user complained that I waste time on these things, and haven’t updated my “90s style website”. Please. Tabroom’s current design is firmly rooted in the early 2010s era web. And this may shock you, but overhauling a site’s design does take a touch more time and effort than throwing up a goofy quote under a logo; and if the latter gives me a moment of happiness amidst a very large todo list, then why begrudge me of it?  More to the point, the outward appearance is not as important a priority as the inward functionality, which has changed an awful lot.

But while Tabroom’s design is not of the 90s, I did a core chunk of my growing up then.  So I have switched to a quote from that era, a core cinematic masterpiece of the transition between the 80s and 90s.

THE HUMAN SIDE

I’ve been coralling and cataloging the vast swathes of photos I’ve taken over the years into something approaching order, and might even print some of them out so I can enjoy them, after hiking hundreds of miles and spending so much time taking them in the first place.  It’s amazing what following through can do.

Since I go away for some of the heavier holidays, I enjoy the lighter ones with family.  This was my third Halloween trick or treating with my nephew.  The first time around he was inert, a 10 month old being hauled around and gawked at.  Last year he was just barely able to say “Ap-eee Allow-een” but not really “Trick or treat!”

This year he still didn’t quite understand the purpose of the ritual — his parents don’t give him much in the way of sweets and sugar, so the purpose of the outing is more the experience.  His mother mostly hopes for Reese’s cups that she can abscond with.  But he’s now nearly three years old, and he has Opinions about what he wants to do, and they did not include the ritual of trick or treat.  He is mostly interested in pressing buttons that make noise. Halloween lawn ornaments looked to him like the type of thing that should do that.  Cue an evening spent chasing him running into people’s lawns to keeping him from bashing their stuff.  But we managed to depart before his first major meltdown and get him to bed almost on time, so it was a good night.

I have sharply reduced my news and social media intake due to Events.  I’ve grown very tired of the competing fear stories, true or not — one’s energy is finite, and there’s a certain gradiosity we do cultivate in our debaters and speechies that Being Informed matters, and that it is essential that we stay on top of these questions of great policy. So right now my phone is a much less consulted device, and my book reader is front and center.  It’s helped me write more, too.

WHERE ARE YOU

This month, I’m headed to Venice for Thanksgiving then a brief sojourn to Amsterdam before returning to Austin for the Longhorn Classic.  Venice is now an old friend, that city that I should probably hate but instead have loved.  It’s a good time to go back to some old favorites, I think, and be at peace in remote shores.

I haven’t planned anything yet for my birthday, but I can promise you it will not be spent here moping.

 

 

October 2024

TRAVEL

I had a fantastic time in Taiwan. First I got to see the country solo, though not so much of it as I had planned. I landed on the northern end of the island at the same time a typhoon was busy pummeling the southern. My side trip to see the old city of Tainan was therefore canceled; instead I stuck to the northern third, where the storm turned out to be a nothingburger despite two canceled schooldays.  But a train ride down to Hualien and a car tour of Jiufen and Shifen on the eastern coasts satisfied my standing urge to escape the city.  And Taipei is marvelously comfortable for an American.

After the solo portion, I was wrapped in the over the top hospitality of the parents of the Taipei American School debate program, where I was wined, dined and regaled with about 81,102 toasts, though I confess I didn’t count. I confess that for whatever reason Taiwan has never been on my priority list of places to visit. But now it’s on my list of places to return to.

One of the things I did manage while there is to spend a whole day writing travel blog posts from earlier trips, so my sequence about last year’s Japan adventure will be posting once a week until done. Let’s see if I can get more written between the end of that and now.

No more big journeys for me until Thanksgiving, when I launch on a European Gallavant.

TABROOM

We had a brief downtime on Saturday, which thankfully happened just as I was waking up anyway in Taipei. The downtime has apparently caused a fair fury of speculative debugging around the socials, because the error messages indicated that a disk was out of space. But the lesson of this speculation is that errors can be misleading if you don’t have the full picture.

Some myths, debunked: Tabroom does indeed run in the cloud, not on a server we run ourselves. That fact isn’t so magic as you might expect: all “the cloud” means is someone else’s server. Cloud services are still subject to the resource limits any other server has. In particular, database servers are tricky to parallelize, or run on multiple machines, so we are not vulnerable to a single instance’s downtime. So while our web servers run 2 instances during the week and anywhere from 4-16 instances during the weekend, our primary database server remains singular. That limitation is from the core tech, and is not specific to Tabroom. So, we are stuck with it.

The root cause of this downtime was not insufficient disk space. Tabroom’s database takes up 36gb of space; that’s all your registered entries, ballot comments, and event descriptions rolled up into one mess of data. The database has its own dedicated disk, separate from the operating system and general server it runs on. That disk currently has 128gb of space; not the largest, but we pay for fast instead of big here, and it’s still 4x as large as Tabroom’s data needs. It’ll do fine for a decade, and we can expand it at will when and if we need to.

That was not the full disk you were seeing errors about.

Instead, a badly written query created years ago for a rarely used results page that experienced a sudden surge in popularity this weekend. That query failed to limit its scope: in order to calculate its output, it was pulling every ballot and ballot score in Tabroom. In 2016 when it was written, that made it slow but not particularly noteworthy. In 2024, every time someone went to that page, a 23 gigabyte temporary file was created on the server disk to run this one query.

At that point, it was only a matter of time: no server can indefinitely handle several hundred 23 GB files being dumped on it. At 4:30 PM CST, the disk hit its limit. Kaboom.

Fortunately, it was a simple matter, once I woke up, to clear the disk, fix the query, and kick the server. That’s life when you’re the sole maintainer of a project sometimes. I simply cannot go back and test and check every one of the thousands of queries that Tabroom regularly runs. When I do confront something like this, I do a review and put up guardrails around this exact thing happening again, but that only solves for the problems I’m aware of. Are there other ticking time bombs in the code?  Probably! Is this true of every other online service on earth?  Definitely!

But I promise you the issue is not that we haven’t found a good enough deal on enterprise disks, or the NSDA is being cheap on the hosting provider. We’ve been pretty lavish this year in terms of server resources, actually. But this particular problem would have blown up no matter how much overkill we’d built into our hosting setup; it was simple the result of the terribly common human errors. That’s what my job is. Consider that your typos at worst can insult someone, or temporarily hurt a student’s grade. Mine can bring down most of speech & debate. No amount of paranoid care can entirely prevent that, even though I do take quite a bit of it.

The worst part of it?  That results page with the query doesn’t actually work properly anyway; its formatting is broken. And since people are for some reason now fascinated by this page, they won’t stop emailing us. I’m going to put up a notice about that, but perhaps will just take it down. There’s little value in spending a week trying to fix this bad spaghetti code when I’m just going to have to rewrite it soon anyway; instead I’ll just move it up the list of things to be rewritten early.

The rewrite goes apace. Right now I’m working on standing up a testing framework which should very much help in finding bad queries before they go bad in production. Having a proper testing framework from the beginning of the rewrite reduces the chances that future changes will go back and hurt existing code without me knowing about it. But it also means I have to slim down that 36GB of data y’all have created over the years, to a set of data that is complete enough that I can test every scenario but doesn’t take 2 hours to load on the testing database. This work is drudgery, but invaluable, which is the worst kind of drudgery. But given the jetlag, it’s probably about what my brain is up for right now.

OTHERWISE

Fall is here!  It’s the best time of the year in New England, except that I’m allergic to it, and still jet lagged from Taiwan. But the days are also growing sadly shorter, which isn’t helping with the lag. I’m hoping to get up into the north country this weekend and spend some time in the outdoors crispness; I’m hoping it won’t be a total tourist mob scene in the White Mountains, but likely we’re past foliage peak up there anyhow.

This blog is and has long been hosted on WordPress. But recently, the WordPress project has decided to set itself on fire, thanks to an apparent hissy fit by the founder. He runs both the nonprofit that owns the open source code and update servers, and a for-profit hosting company, Automattic, built on the same software. He claims that a competing hosting company, WPEngine, isn’t giving enough back to the community and somehow abusing trademarks in a nebulous way. But WPEngine isn’t required to give anything back at all, and the trademark claims seem spurious to me; that was enough to raise my antenna. And then the founder leveraged control over the nonprofit to cut off WPEngine from the open source code and security updates. They took over a plugin created and maintained by WPEngine, and pushed out their own changes to it, as well as renaming it, under the guise of “security.”  This update would have auto-installed on thousands of blogs without the administrators thereof consenting to the change or even being aware of it.

That final bit crossed the Rubicon in my book; I no longer trust WordPress, and will therefore soon transition this site to another platform. Honestly, WordPress was never a perfect fit for me anyway, and because it is so common on the web it also requires a lot of security filtering; even my little blog suffers near constant hacking attempts. The most obvious alternative appears to be Ghost, which has the virtue of integrating in email subscriptions, so if you are one of the three people who regularly like to keep updated on my blather, you’ll have that as an option soon.

I don’t really want or need an additional side project, but so it goes sometimes in the world of open source.

September 2024

September steals into the world with a lovely crisp week of New England fall weather, cool and perfect, but with that bright bright sun that we see all summer and miss all winter.  I went up to York, ME for Labor Day, giving me a chance to walk in the cooling breeze during sunset and wave goodbye to summer, a stolen tradition but one I quite like.

And now we have the reality of the approaching school year.  Boo, hiss.

Tabroom

I am crawling to the bitter end of the list of things I have to do with Tabroom before I can feature freeze it.  I had hoped to finish that before leaving for a sojourn with Mock Trial folks last month but it was fantasy, as ever.  My last challenge is to build a front end that can take advantage of Tabroom living in the cloud now, which will let people other than me autoscale the power of our installation upwards if the service is lagging.  Right now we can scale it up, but the process is picky and technical which means only Hardy and I can do it, and as anyone in front line support will tell you, you need at least 3 people for 24/7 coverage.

So once this is done, I can permit others in the NSDA hit the “More power!!!” button when there is a slowdown.  The process of programming it is quite tedious, however.  One major requirement is making sure that folks without a programming background can understand the nature of the problem before hitting buttons that will cost us a lot of money.  There are times when there are Tabroom slowdowns for individuals that aren’t actually server overloads — their local internet is having trouble, or the provider’s is.  A system can report load metrics to tell you if they’re struggling and why — but these are a little arcane and hard to read, and stored in multiple locations.  So part of this task is me having to read badly formatted data from six different sources and present it to a colleague in my department such a that an intelligent non programmer can understand and act on it.

That job is nitpicky and tedious, and prone to look right when in fact it is wrong.  When you’re trying to sift through a dozen bits of data that are all decimal numbers between 0 and 1, and you pick the wrong column, it still appears okay unless you check it very carefully.  It’d be easier if the wrong answers were all 439,981 when the right answers were 0.31.

But the nice news is I’ve been able to write more of this backend in NodeJS and not increase my rewrite woes yet further on the cusp of being able to focus on it exclusively.

I’m also working on a pretty comprehensive set of documents for Tabroom aimed at Mock Trial usage.  It’s coming along well, though it is reminding me that we really do need to show some love to the docs for Speech & Debate usage as well. I’m hoping I can actually use some of this MT stuff to help out S&D.  Two public speaking activities, helping each other.

Tournaments & Travels

My slimmed down schedule includes two this month:  the Kentucky Season Opener on 9/7 weekend, and the Jack Howe Memorial at CSU Long Beach on 9/28 weekend.   And then, in more distant and exotic news, I depart for Taiwan, partly for the Taiwan Speech & Debate Invitational on 10/12 weekend, but also for a week of seeing what the island has to offer first hand.  I admit Taiwan has never been high on my travel radar before, because I didn’t know much about it.  I sat down on a long flight last week to do some reading, and it took me exactly one blog post to go from “How should I fill my time there?” to “How on earth can I narrow this list down so it’s manageable?

Otherwise this is also the stunning time of year when New England gets to lord our superior weather over the rest of the world.  It doesn’t happen often, so we tend to grab it with both hands when it does.  The humidity blows out into the ocean, taking the bugs with it, and then the leaves turn bright, and I start moseying northwards to the forests more often.

Writing

I haven’t done squat with the eight or so ideas I have for a travel blog that people keep pestering me about.  But I’ve written three full chapters of this book I’ve been toying with.  I’ll likely never have the gumption to share it with anyone else, but it’s been edifying practice to write it out, and it gives me an excuse to put the coding linter down sometimes.

I made a clipboard to write with out of purple heartwood that came out decent. No photos yet, and I suspect I made it a touch too thin and it’ll warp, but as a first shot working with a new hardwood, I’m decently pleased.  If it does warp, I’ll try a layered version next and see.