Runaway Intervals, Stuck Shut-Off Valves and Nacho Cheese

 

I had a pretty interesting bug come up at work a couple days ago. The bug itself and associated events illustrates several things: runaway setIntervals, why I love living on the East Coast, why you should sever connections to internal resources before hitting an external API, and, most of all, the sheer inventiveness of end users in finding any opportunity, however esoteric, to bring down your system.

Let’s start with the last one. This is one of my favorite things about where I currently work. I have the privilege of working for a company with a very large user base. Out of a total of over 100,000 customers, at any moment several thousand are logged into our site and cruising around, doing what they do. When you get into that large a number of people constantly hammering on your code, it tends to turn up really weird, obscure bugs that you never in a million years would have uncovered on your own, no matter how much testing you do pre-release. Here’s a contrived example…

Let’s say you have a contact form where a customer types a review of one of your products and you have some javascript that monitors what they type and periodically fires off Ajax requests. You may have tested it six ways to Sunday, using unit tests, integration tests, focus groups, the works. But once you unleash it into the wild, if you have a big enough install base, eventually you get someone like Adrian. Adrian is a Norwegian expatriate living in Japan who wants to enter a negative review of one of your nose hair trimmers. He’s seriously pissed off because that thing HURTS. While he types out his caustic review, he’s eating Nachos and drinking a 2010 Simi Valley merlot. He’s on his third glass and feeling a little tipsy. Before you know it, Adrian passes out after dripping Nacho cheese all over his keyboard, making it get stuck in a loop while typing the Japanese characters for pain (I was going to include them here parenthetically but WordPress converts them to question marks, so I’ll have to describe them for you instead. The first character looks like a fat guy in a bus stop; the second one is a kind of cursive capital H). Anyway, your javascript doesn’t like the fact that these keys have been depressed for over an hour and bad things happen. Now ask yourself… how could you possibly have anticipated this? Why would anyone ever intentionally keep a key down that long? Plus, what kind of person mixes red wine with nachos? I mean, it’s common knowledge you go with white wine when eating fish or cheese.

Anyway, it was someone like Adrian who almost caused a site outage for us last week. We use memcached as a caching mechansim on our site and our Dev Ops guys noticed a climbing number of connections from a couple of our servers. I started to look into the issue but then my wife called me and told me a pipe had burst at our house. We live near Philadelphia and we’d had single-digit temperatures for a couple days straight, resulting in a short piece of pipe on the outside of our house going to a hose faucet freezing up and bursting. I knew Dev Ops had the situation under control so I cleared it with my boss, and headed home to deal with my plumbing issue.

A helpful neighbor had alerted my wife to the problem; he’d then come inside and tried to close the valve leading to the faucet, but it was rusted shut, so he had closed the main shut-off instead. I couldn’t close the valve either to be able to safely turn the water back on, so I drained the pipes and took off the bottom half of the stuck valve. It was crazy old but I figured I might be able to get a replacement for just that piece and not have to pay a plumber to solder a new shut-off into place. I spent an hour going to both Lowe’s and Home Depot, but neither of them had the part — not surprising since Eisenhower was probably president when it was installed. Luckily, the one sales associate recommended I try some liquid wrench. After soaking the valve in the oil for five minutes, I was able to ease it closed, letting me close off the line to the broken faucet and turn the water back on to the rest of the house.

Which leads me to why I love living on the East Coast. I’m the type of person who needs four seasons to be happy. The brutal, painful, freezing cold of winter makes you appreciate summer all the more. I don’t understand how anyone can live in L.A. — warm weather all the time, women walking around in short-shorts in the middle of December, seeing celebrities when you stop by the mall to buy a new nose hair trimmer. Who wants to live like that? I’ll take pipes exploding and knuckles cracking open from the cold any day, since it means I never take the Jersey shore for granted.

So I get back into work the next day, having resolved my plumbing emergency in a way that makes me feel like a true Man’s man (even though all I did was close a fucking valve) and Jimmy brings me up to speed on what went down.

Our website has a Facebook integration that allows customers to link their Facebook account up to our service. Apparently, Michelle S. from Alabama had somehow managed to hammer away at our authorization endpoint 5 times per second for the better part of an hour. Before too long, this swamped memcached and brought us close to an outage, which Jimmy was able to avert by temporarily moving some data out of her account to silence the endpoint.

The nuts and bolts of what went wrong illustrates the right and wrong way to architect an endpoint that queries an external service. This particular endpoint assembles data from three different resources: our database, our memcached cluster and Facebook’s external API. There is a central design philosophy that you need to adhere to when working with this kind of system: sever all connections to your own internal resources before you open up any connection to an external API. Let’s say Facebook has issues on their end and calls to their endpoint don’t complete for 90 seconds. If you open a database connection to do some queries on your site and don’t bother to close it before issuing the API call to Facebook, that 90 seconds that you’re waiting on Facebook is an extra 90 seconds that that database connection is being held open. Multiply that by the number of people trying to use your Facebook integration at that particular moment. Then multiply it again by 2 or 3 to account for people getting fed up waiting and hitting refresh every five seconds to try and get the damn page to load. Before you know it, your database connections are all gone, and your customers are seeing that hilarious graphic of the nerdy looking hamsters running around the colo cage that your design department came up with last week to display during a site outage. Who knew it would come in handy so soon?

A prior outage had led us to ensure our code closes the database connections before querying Facebook, but we had neglected to do the same for memcached. Still, this wasn’t the only bug that had come into play; there was also a frontend bug that had caused those requests to hammer away at us, every 200ms like clockwork. And this illustrates another important point: Sometimes, when something goes wrong, if you dig deep enough, it’s actually several somethings. And you should find and fix them all.

Here’s how frontends for authorizing a Facebook application typically work. You click a link on the site and it pops open a new window displaying Facebook’s “authorize app” page. The javascript on the page uses setInterval to register a method to run every 200ms to check if the authorization pop-up has been closed. Once you click to authorize the app, the window closes and your Facebook profile can then be accessed via an ajax call to the integration endpoint. Here is some pseudo-code I wrote that shows the general process flow:

Web_-_sites_aweber_app_views_layouts_default_thtml_-_Aptana_Studio_3_-__Users_dand_Documents_Aptana_Studio_3_Workspace

There’s some code that isn’t shown that opens the new window to Facebook’s auth page when you click on a button in the #Load-Facebook-Div and then calls the above init() method. So basically:

  1. Click a button in a DIV to open the new Facebook window
  2. The init() method replaces the button with a loading message and polls every 200ms to see if the Facebook window is closed (meaning the auth happened).
  3. Once the window has been closed, the interval is canceled and an ajax request is triggered to fetch the newly loaded Facebook profile details out of the database and show it on the page

It seemed pretty obvious that the high frequency of the customer hammering on the endpoint implicated the 200ms setInterval call. But what exactly was going on? I looked closer at the code and noticed what appeared to be a minor race condition. We are storing the return value of the setInterval call in “this.interval” so we can cancel it later. But what happens if a second interval were created? The reference to the first interval would be overwritten and it would never be canceled and would keep firing forever. But how could I make this happen?

My first thought was that a double-click on the button might do it. Maybe the customer clicked it so fast that they snuck a second click in before the javascript could hide the button and prevent a second request. But no matter how fast I clicked on it, I couldn’t trigger this second request.

I briefly considered what would happen if a customer had two tabs open in their browser, loading this page twice and then clicked the Facebook integration on each one. But, I dismissed that, since each tab would have it’s own “window” object and maintain state independently.

In the end, I was able to make the Ajax request fire every 200ms by first clicking the link to start up an initial interval. Then I went into Chrome Dev Console and entered Namespace.facebookIntegration.init() to manually trigger a second setInterval that would overwrite the reference to the first. Sure enough, when I closed the Facebook window, the orphaned interval began hammering away on the endpoint.

I had proved that a lack of defensive programming on the frontend could result in a runaway interval that sent large amounts of traffic to our integration endpoint. But I still didn’t know exactly what Michelle S. in Alabama had done to uncover the bug. I’m going to assume that she didn’t de-minify our code, pour over it to find race conditions then open Chrome Dev Console and manually call the endpoint a second time just to fuck with us.

Even though I didn’t know exactly what caused the bug, I knew how to fix it. There are 2 ways actually:

  1. Try to cancel any pending interval before opening another one, rather than assuming it’s already been canceled
  2. Make this.interval an array so it can support tracking multiple intervals. When the Facebook page is closed, clear all the intervals, rather than just the last one

This frontend bug is a great example of a class of bugs which I rarely encounter and find somewhat frustrating when I do: a bug that you are able to fix without fully understanding what’s causing it. It’s always better to have a complete understanding of the scope and progression of a problem to ensure that your solution adequately addresses everything that went wrong. But sometimes you simply can’t know for sure exactly what went down.

In the end, if you’re able to make your frontend code cleanly handle someone triggering hundreds of ajax lookups because a key on their keyboard is held depressed for over an hour, it really isn’t necessary to know that it was nacho cheese sauce that was holding down the key.