Sigh.
So look, I could just make up a pile of believable nonsense sprinkled with technical terms here and pawn the blame off on a crevulating internexus or something.
Nah. I simply screwed up. I built the thing that scales the Tabroom web server base up and down to be a guided process, because I didn’t have enough data to make it automatic; since I didn’t know the proper ratio of tournament/users to servers, I couldn’t tell the computer. It was a judgment call. So the process was manual.
And well, today I simply forgot to do it. I didn’t go to a tournament myself, so I slept in a little bit and didn’t remember until my phone flipped out. And then, two hours later than I should have, I hit the Big Red Button and the gears spun up and it was all fine 15 minutes later. Mea culpa.
There’s a silver lining, however. At first I spun us up from our weekday standard pair of servers to a full 10. Tabroom, for you, came back up and was fine from that point forward. But the performance numbers were cheerfully and consistently in the orange range. Nothing was overloaded, but we had little spare capacity.
That’s interesting, because now I have a sense of where the line is.
It’s especially indicative because Tabroom was maximally busy right then. One of the challenges I have with Tabroom blips is that for 10-30 minutes after, Tabroom experiences much heavier load than usual. That’s because you all build up a backlog of things for it to do; some rounds are delayed, some judges wait to enter their ballots, and so on. When the site comes back, everyone rips through their backlog at the same time. So I watched my newly adequate servers balance on the edge of what they could do, I knew this moment was also likely the limit of how many operations they’d ever be asked to run on a weekend with this many tournaments and users.
This weekend Tabroom is hosting 92 tournaments with 14,688 individual competitors and 6,141 individual judges. That implies 10 servers can just about handle 20,000 prospective users, which makes 2,000 users per server. That, my friends, is what we call actual data, not “Palmer’s gut.”
Computer folks sometimes call themselves, or are called, ‘engineers.’ I don’t use the term — I refer to myself as a software developer instead. Real Engineers™ have the duty, but also the luxury, of checking things exhaustively before they actually build anything. We all want them to, since they are building bridges, schools and hospitals. Today fantastic structures are created through a deep understanding of physics, material tolerances, weather, and so on. The process is guided by obsessive checking and regulation. It takes a lot for an engineer to stamp and sign a set of plans, before any dirt is dug or hammers are swung.
In the Roman era, an engineer who built a bridge or aqueduct had to live under it for a year after with his family. They understood physics and materials less well than we do, so instead they overbuilt the hell out of things. It’s small wonder so many of their creations still stand.
The pace of change and resources in computing doesn’t permit us software developers that standard of care. We’re expected to produce a swifter pace of change and new features that engineers aren’t asked for. So, we often end up out over our skis and the whole thing comes down. That’s not great for Tabroom, but that’s nowhere near the tragedy of a bridge collapse. More resources don’t help, because with them come more demands: did you know that Facebook and Google each had more total downtime than Tabroom did in 2024? That fact makes me feel… a bit better.
Therefore, in computing we end up with systems like “Palmer has to press a button every Friday night or the whole thing explodes. Let’s hope he remembers!” Imagine how quickly everyone involved would be fired if a railroad was built that way.
However, we do have some commonalities with Real Engineering. We share a sacred dedication to safety margins. Ten machines this weekend was just barely enough, so instead of watching like an dunce to see if it would tip over again, I immediately spun up six more until all the numbers were vividly green. In the safety of the hindsight, I can say that two more servers would have been fine, and even four more was a touch excessive. Six was blatant overkill, birthed of a morning’s panic. I’m not sorry.
But now, I am armed with actual data. I can set up Tabroom to automatically spin up and run a new server for every 1500 or so anticipated users, that extra 25% being a generous but not overly expensive safety margin. And that automated process will not be forgotten. What makes computers useful is that they have different strengths & weaknesses than humans. Computers cannot be told to “eyeball the number of tournaments and think about what we’ve done in the past and spin up a bit more than you think we need.” Even modern AI is likely to take that instruction and try to run 3,400 servers and bankrupt us, or -12 servers and break the laws of reality. They require a real formula: “Run one per 1500 users”, they can do.
But If I tell the machines to spin up that many machines every Friday at 4PM Eastern, then they absolutely will do that every weekend within seconds of 4PM Eastern. My imperfect human memory is replaced with a guarantee. But there’s still a catch: it’s a guarantee the job will be attempted. That automating code will still be the product of my imperfect hands, and therefore might fail even though it was tried. If I run the job and it fails, I see and fix it right away. An automatic job cannot self-correct.
So I’ll still check it. But let’s say that I was 99% certain to remember to spin up the Tabroom servers manually. That sounds good, except when you consider that we have 365 days in a year, so that’d be 3 1/2 days of downtime on average per year from this cause alone. Today was that 1%.
That’s not nearly good enough. So we multiply it against another 99% certainty: that I can build an automatic scaling system that runs correctly. So now there’s a 1% chance that fails, and a 1% chance I forget to check it. We land on a 99.99% certainty one or the other of those things will happen any given day. That would take a decade to explode again. That’s likely good enough, but we’ll still add another layer. We’ll make sure another NSDA staffer also checks every Friday, so they can scream at me if it hasn’t happened and I did not notice.
Now we’re at 99.9999% certainty. At that rate of risk, downtime from this type of screw-up would be on average half a second per year. If only we could handle all risks so easily. Getting to that “six-nines” of coverage — which is how the computing industry refers to it — costs me building a script, and another employee 5 minutes weekly each Friday. Doing it in some other areas of our installation would cost us several million dollars.
So we do what we can. Maybe I should have chosen a lower stress career, like disarming landmines or cleaning up nuclear waste or something.