Dont Let Your Buttons Overlap

While testing some changes I made to a site for work, I caught myself making a somewhat common UI/UX mistake.

Here were the original screens. They show Step 1 and Step 2 of a “Record” functionality:


Step 1

VoiceVibes 2

 


Step 2

VoiceVibes 3

 

Can you spot the problem?

The issue, which is admittedly minor, comes when users double-click the Stop button in Step 1. The first click advances the screen to Step 2 and the second click triggers the Cancel button. Most users know to only use a single click on buttons, but the capacity for users to misuse your UI can never be discounted.

One solution to the problem is to ensure there is an “Are you sure?” confirmation on the Cancel button. I already had that in place but didn’t consider it sufficient.

Another solution is to use Javascript to detect and suppress the second “half” of the double-click. But… ain’t nobody got time for that.

In the end, I tweaked the layout on the Step 1 design to push the button further down so it no longer overlaps the other button in terms of screen position:

VoiceVibes

Campaign Error Logging with New Relic Insights

Several weeks ago, a project which I’d worked on for the better part of two months finally went live across all LeadiD customer campaigns. In many ways, I am prouder of this project than I am of anything else I’ve done in the ten years or so that I’ve been doing web development — in large part because the finished product is just so damn cool.

Because I can’t help myself, I’m going to jump right to the pretty pictures. By leveraging the Insights reporting backend provided by New Relic, LeadiD now has access to a full dashboard that tracks all the errors and debug data associated with our customer’s campaign scripts:

Insights__Campaign_JS

Why is this Cool?

Thousands of businesses use New Relic or their competitors to monitor their web sites and track errors. Why is what LeadiD did so interesting?

For the answer to this, you need to look at what it is we are actually monitoring. In order for LeadiD’s core services to operate, our customers must include the LeadiD campaign javascript on their site. Simply by including this script on the footer of their site, our customers gain unique insight and visibility into the actions and motivations of their site visitors. First and foremost, our javascript was written with two main areas of focus: performance and stability. Nonetheless, millions of individual consumers execute our javascript each day and some of those visits result in javascript errors. It is these errors which we are now monitoring.

Specifically, whenever an individual consumer triggers an error in our campaign javascript on any of the 20,000+ web domains who have partnered with LeadiD, we get a record of that error in our New Relic Insights dashboard, where the data is tracked and graphed so we can immediately see the impact as we push out new features or fix pre-existing bugs.

How Does it Work?

campaign_js_blogAs shown in the diagram, a customer site AwesomeWalruses.com has implemented the LeadiD campaign javascript. Every time a visitor encounters an error in our script (or if debug logging, which I’ll discuss later, is enabled) a record is sent back to a logging microservice running on our domain. Typically, this request is sent as an asynchronous XmlHttpRequest from the client browser and includes all the available details about the error. From there, the microservice pushes the data into New Relic Insights.

One key item to note is that regular javascript errors on AwesomeWalruses.com are not reported back to our logging system. It is only errors in our campaign javascript which are being tracked.

To get this level of granularity, much of our campaign javascript is wrapped in try/catch blocks, instead of using the alternative method of binding to the global window.onerror handler. While the latter solution can likely provide a more sure method of capturing all errors, it is more risky and complicated when you consider the need to filter out errors from the parent site. Try/catch blocks, however, carry with them their own performance considerations.

One crucial feature which our campaign javascript now supports is a verbose debug logging setting. If this is enabled for a specific customer campaign, LeadiD receives javascript performance data from all visitors to a site, even if an error is never triggered. Because this is such a huge amount of data, the messages need to be batched together when they are sent to us and then unbatched in the microservice so each item is saved to New Relic separately. It’s interesting to note that enabling debug logging for a single customer triples our daily Insights data usage with New Relic (visible in the “Volume of Log Requests” graph above).

Preliminary Project Design and Initial Prototype

When the project was first conceived, the specific backend where the logging data would be stored was an open question.

LeadiD was already using New Relic for server monitoring and RayGun for application error logging. As a result, either service had the potential to be a good fit for this project. Additionally, I had previous experience with the open source project Sentry and, as a free software alternative to New Relic and RayGun, there were definite cost advantages to going with Sentry. As an additional free software alternative, we considered the ELK stack (ELK stands for Elastic Search, Logstash and Kibana).

With assistance from our first-rate Dev Ops team, I prototyped a backend that injected data into all the services except Elk, which we never tested due to the effort. (The prototype microservice was thankfully thrown away and reimplemented by Senior Engineer Ian Carpenter, which is what you hope for when you build a prototype). Next, we drew up a comparison table showing each choice’s strengths:


Campaign_JS_Logging_Backend_-_Engineering_-_LeadiD_Confluence

To cut the chase, New Relic Insights was very compelling for 2 reasons:

  • The ability to run SQL-like queries against the data and graph the results with very little work
  • As a SaaS-based solution, we save large amounts of time versus writing and maintaining our own solution

One of the interesting realizations we had when comparing the chart above with our “technical requirements” was that error data and logging data are fundamentally two different things. Three of the four alternatives are more directly geared toward error data than storing normal application logging data, but New Relic Insights (which is essentially a NoSQL database masquerading as a relational one) proved itself able to be molded to fit both error messages and logging data. A few of the negatives to New Relic noted in the table ended up not mattering after all:

  • The 256-character limit on tag length is mitigated by simply spreading your data across multiple fields
  • While we relinquish some control by having our error and logging data stored in New Relic’s system, the Insights API allows integrations with this data. (I even wrote a script called insights2csv to easily pull down our data so we can import it into PostgreSQL and elsewhere)

The Results

LeadiD now has a system that allows us to monitor in real-time the impact of our client-side releases. This enables us to monitor new versions of our javascript as it goes live across all our customer sites and compare the volume of errors thrown by the new version against the historical norm from prior versions. If we begin seeing more errors than expected for the new version, we can immediately roll back the deployment and begin digging into the logging data itself to figure out what went wrong.

To this end, we can easily pull the data down from New Relic Insights to feed it into other databases to enable more advanced queries. This also facilitates resource sharing with our Data Science department, who are now able to use the data on which consumers encountered errors to exclude that problematic data from the datasets they use to train machine learning and develop scoring models.

Several specific examples of how LeadiD Engineering are using the Insights data are visible in the dashboard graphs above. We are now able to identify specific customer sites which are sending us high volumes of error data. It is far better for a company to proactively identify issues and reach out to its customers than it is for the company to fix the problem only after customers have noticed and complained. LeadiD is now in a position to notice any such errors before our customers do and before the errors have a chance to have a major negative impact on our products.

By partnering with client sites to run trials of the verbose debug logging, we have already identified several potential improvements to our libraries. Being able to follow the breadcrumbs left by the debug logging is an essential tool in keeping our scripts up-to-date with the latest browser APIs and in ensuring smooth co-existence with the various javascript frameworks in use today. Without this logging ability, it is an order of magnitude more difficult to even recognize that something is wrong, let alone figure out where in the code the problem may reside.

All in all, the finished Campaign JS Logging system has met or exceeded all the project goals. I had a lot of fun developing the finished solution and it continues to provide value to LeadiD and our customers on a daily basis.

MS-7352 Manual

51DCc9FrsJL

I just finished one of the trickiest computer builds I’ve ever done. The main annoyance was the motherboard I bought doesn’t have a manual (since it was OEM no manual exists). I wanted to document everything I’ve learned about this motherboard for posterity.

First some info on the motherboard so this post shows up in search engines: MSI HP Intel Q33 Socket 775 mATX Motherboard manual. Also called: MSI MS-7352 manual and installation. The motherboard itself was apparently used exclusively in Hewlett-Packard desktops, which explains why no manual exists anywhere online.

I bought the motherboard as a replacement for the one in my PC which died. Since it was only $40 I thought it was a deal (and it was) but it is an extremely tempermental mobo and was very difficult to get installed and working. Thankfully there were some very helpful resources online so I wanted to make this blog post to gather all the relevant info on this mobo in one place.

The initial problems I had installing this were some combination of the following:

  • It won’t boot without RAM and a CPU. When I say “it won’t boot” here and below, I mean it shuts off immediately without going into BIOS or giving any indication that isn’t broken. It doesn’t show anything on the monitor and it doesn’t send power to your USB devices (although the fans will briefly spin).
  • It won’t boot if the RAM is in slot 4 instead of slot 1. The slots are numbered on the mobo so make sure you’re plugged into slot 1 with a single stick first.
  • It won’t boot if the case power switch is hooked up wrong which is easy since it’s mislabeled on the board itself
  • It won’t boot if the CPU fan isn’t plugged in
  • It won’t boot if the case fan isn’t plugged in
  • It won’t boot if the CPU power connector on the mobo isn’t connected
  • (Potentially) it won’t boot if the CPU fan/heatsink isn’t secured properly or isn’t compatible
  • There’s no feedback when it won’t boot to tell you what’s wrong, except maybe a few beeps on the speaker, but there’s no manual to tell what the beeps mean and the 2-wire connector on the mobo was incompatible with my case’s 4-pin speaker anyway. So I had nothing to go on.

Case Power Switch Hook-Up

I’m pretty sure the labels on the mobo for where to put the various switches are wrong. I used this config instead and everything seems to work great. Credit due to http://www.mxdatarecovery.net/msi-ms-7525-bostongl6-motherboard-power-switch-jumper-settings.html

ms-7525-msi-motherboard-front-panel-power-switch-hd-led-connectors

CPU Issues

I tried to use an Intel heatsink that came with my Core 2 Duo but the snap-based connectors didn’t fit the holes in the mobo. I tried holding it in place just to get it to boot but that didn’t work so I think it may have been incompatible electronically with the hardware check (although the fan did power on). Anyway, this was the heatsink I eventually bought that worked perfectly. It comes with its own thermal paste and also a backplate that its supposed to screw into but I ended up not needing the backplate since the threads on the fan screws matched the threaded holes in the mobo. Don’t overtighten it though.

Fan Issues

As mentioned above, CPU fan must be compatible and both the fan power and the CPU power plug on the mobo must all be plugged in.

Additionally, the case fan must be plugged in and working or the system will refuse to boot. This was a problem since my fan needs a 3-pin connector and the mobo only provides 2-pin. I bought this adapter which lets me send power to my system fan via my power supply. This bypasses the mobo connector, however, so the mobo still thinks there’s no fan and shuts itself off with a broken fan warning after briefly showing the BIOS. There may be a better adapter than the one I used which lets you actually use the 2-pin conector on the mobo.

What I ended up doing, however, was disabling the fan hardware check in the BIOS, but it’s tricky to do. As the computer boots you need to continually hit Control-F10 until you get into the BIOS screen where you can access the Hardware Monitor / Fan Check options and disable all of them.

Credit (and full instructions on disabling the Hardware Monitoring) are at http://retrohelix.com/2012/09/how-to-fix-the-f2-system-fan-error-on-some-hp-computers/

Success!

Once I addressed all the above issues over the course of about 4 hours, I finally got the motherboard to boot into my preexisting install of Ubuntu and everything worked great. I was happy that Ubuntu had no problem with me switching to a completely different make and model motherboard and that the HP MSI MS-7352 eventually worked as expected. The only curve balls were the need to buy a new heat sink, that fan connector converter and the 4 goddamn hours it took me to figure all this shit out.

Connection Management

In my experience, an often-overlooked principle of endpoint design is the idea of minimizing connection use. Any time you connect to an external resource, you should try to avoid connecting to it as much as possible and, if you really do need to connect to it, connect for as short a period of time as possible. And when you do connect, don’t have anything else connected at the same time.

Avoid Connecting to External Resources

If you have a system that rejects requests because they’re invalid, request validation should happen before you connect to the DB so that invalid requests never even trigger a DB connection.

Similarly, don’t open a DB connection on ALL requests just because some percentage of requests need to execute a particular operation. Wait until you actually need to perform the operation to open the connection.

For other use cases, can you cache something locally and periodically invalidate the cache to avoid touching the DB on every request? In other words, does your application really need real-time data or is some level of delay acceptable so you can implement caching?

Connect as Briefly as Possible

An anti-pattern in PHP is to open a DB connection immediately then let PHP close it for you when the page load is finished. If you do this, you are keeping the DB connection open for longer than you are actually using it, meaning you may need more total connections to support your app.

A best practice is to wait until you need a connection, then open it, run your queries and immediately close the connection before proceeding to actually use the data you fetched.

Another, more debatable piece of advice is to do data transformations and joins in your application code instead of in your DB queries. Again, this will make things easier on your database by distributing work to your (cheaper) application servers. But the cost in complexity and some level of waste may make this a non-starter.

Don’t Overlap TCP Connections

Let’s say you have 1000 available MySQL connections and 50 available memcached connections and you keep them both open at the same time. You now have a situation where a slow-running MySQL query results in memcached connections being held open and your system goes down once your 50 memcached connections are swamped (even though you potentially had sufficient MySQL connections to actually handle the slow query, had the connections not overlapped).

This also holds true for other types of connections, such as Curl requests to external APIs or messaging systems like RabbitMQ or Kafka. I once saw slow-running requests to Facebook’s API cause my company’s memcached connection pool to be exhausted because the two were allowed to overlap.

Final Thoughts

Connection pooling solutions can help reduce the latency of opening new connections but this doesn’t alter the inherent danger of depleting the maximum number of allowable connections within the pool, so much of the above still applies.

One thing that may turn the advice above on its head, however, is the move to multi-threading, asynchronous requests and the concepts of Actors and Futures in newer languages like Scala. In these environments, I can envision scenarios where you may want to preemptively fetch some data on a separate thread before you’re sure you need it because the cost in connection usage and the waste of sometimes getting data you don’t need is a fair price to pay for a modest decrease in average response time.

The ability to do two things at once turns all the advice above on its head. The times they are a’changing.

——————————————————-

Comment on reddit here.

food != food

I ran into this issue at work several weeks back and figured I’d drop a quick blog post about it, as it’s kinda nifty.

Some string matching code in our backend was reporting mismatches on identical text. I found an example of the text and played around with it in console and was able to isolate it to the following:

Add_New_Post_‹_Ruminations_from_the_Long_Doctor_—_WordPress

This was a bit of a headscratcher. If you don’t believe me, copy and paste this code into console and try it yourself:

'food​' == 'food'

So… can you figure out what’s going on here?

Go ahead. I’ll wait…

It took me a hot minute to suss it out, but for me, the thing that clued me in was using the arrow keys to step through the string. You’ll notice that it takes an extra keypress to navigate between the letter d in the first “food” and the closing apostrophe.

It turns out there was a unicode character there called the zero width space. This character exists on the page and will come into play when doing string comparisons, but is effectively invisible.

This got me thinking… I know there are quite a few characters that appear identical but have different code points. So there might be an ê with a little hat on it in one language that appears identical to the ê in a different language, but they are technically different unicode characters and thus not at all the same.

So, if we can have 2 words that look the same but are different, can we have 2 words that look different but are the same?

This was the best I could come up with:

Unicode_Character__RIGHT-TO-LEFT_OVERRIDE___U_202E_

There is a unicode character called the right-to-left override which basically opens a portal to Hell by turning all the text appearing after it backwards. Go ahead and paste this into your console and play around with it a little. But make sure to step away before you shoot yourself:

'‮food‮' == '‮food‮'

It turns out this character has some interesting security implications, as it can be used to manipulate URLs to make certain phishing attacks easier.

Shortening the Feedback Path

A subconscious best practice I’ve noticed in the world of web development is the drive to shorten the feedback path. This is best illustrated by an example.

Let’s pretend that a web form has various collapsible sections. The top section is expanded by default, pushing the bottom (collapsed) section down below the fold. Once the bottom section is expanded, there is a link that when clicked opens up a modal date-picker.

If you don’t optimize things, your workflow may look something like this:

  1. Make some changes to the date picker javascript
  2. Refresh the page
  3. Click to collapse the top section, bringing the bottom section into view
  4. Click to expand the bottom section
  5. Click the link to show the date picker
  6. Test your change

Obviously, this is a somewhat sub-optimal workflow that can grow quite painful quickly. Luckily, there are a number of different approaches to streamlining things a bit.

The Best Best Practice: Tests

The ideal situation for developing your date picker is to do it in a test-driven manner.

If you can make the case that tests are worth doing (and they usually are), it’s definitely the way to go. This is especially true if you are writing a library or a widget that is intended to be used in numerous different places. If our fictitious date picker is something we are writing from scratch, even if it is only likely to be used on this one page, the best way to go about it is to treat it as a widget with its own test suite and develop it in isolation from the rest of the page.

Tests provide immediate feedback. When you make a change to your date picker, you can have your tests auto-run in a separate window and get instant gratification. This completely side-steps the painful process of testing a widget that is hidden deep in the interactive guts of a complicated form.

The idea of doing test-driven development in javascript only became a reality fairly recently. There are many different combinations of javascript libraries that can be used to facilitate this, but the ones I find myself using most frequently are:

  • Jasmine: Provides a BDD-based testing framework
  • PhantomJS: Enables headless browser testing. You can run meaningful tests against a WebKit-based browser solely from the commandline
  • Grunt: A task-runner to tie all the pieces together

Sometimes You Can’t Do What You Should

I wouldn’t quite call unit tests for javascript a luxury for the same reason a seatbelt isn’t a luxury. Actually, a better analogy is probably an air bag in the early-90s. All the nice cars had them and everyone knew air bags are the way to go and something you should have in your car. But you spent all summer working at Dairy Queen for $4.65 an hour and only saved up enough money to buy an ’82 Buick Regal with white leather seats, an FM radio and no air bags.

And so, sometimes we find ourselves debugging a date picker without recourse to tests.

In order to avoid the painful multi-step process of refresh, click, scroll, then click again to see the results of our change, there are a couple things we can do.

Automate the Clicking

Rather than forcing yourself to click on the page elements each time, figure out what the equivalent jQuery code is to trigger these elements and then automate it.

The code might look something like this:

$('.collapsible-header:first').click(); // hide the top section
$('.collapsible-header:nth-child(3)').click(); // show the bottom section
$('#dingleberry-form a.picker-trigger').click(); // show the date picker

There’s a couple ways you can trigger this code. The easiest is to simply type it into the javascript console as a 1-liner and trigger it manually. You can then use “arrow-up” to get to it after each page refresh. This isn’t a bad way to go, but we can do better.

Another option is to surround it with a $(function() { }) so it will be called onload and put it into our page’s JS code somewhere. This completely automates it with only a single drawback: you may forget to remove it later. Add a “TODO: Remove this shit” comment above the code to help draw attention to it later.

Or even better, install dotjs or a similar extension so you can execute arbitrary javascript on your pages without actually putting it in your versioned, “production” files. But at that point, maybe you should just write a test.

Keep Shit Modular

Another way to handle hiding the section you don’t want to see is to simply delete it off the page temporarily. If you’ve kept your code modular and concise then each page section will be its own template that you can suppress by commenting out a single line.

This technique proved invaluable to a page redesign I’m currently working on. There are 3 sections to the page and I placed each section in its own template. When I need to make a change to one of these sections, I can easily comment out the 2 lines in the parent template that include the sections I don’t want to see.

This technique really makes me appreciate jQuery’s complacent attitude when something doesn’t exist — e.g., $(‘.jacobs-ladder’).css(‘height’, ‘10000px’) won’t throw any errors if there are no such elements on the page.

Auto-Refresh the Page

Manually clicking the refresh button on the browser is for suckers. There are a couple good options for automating this:

  • grunt-contrib-watch: A Grunt plugin for directory watching. A great option even for non-javascript projects
  • Tincr: A chrome extension for auto-updating JS/CSS that’s been changed without needing a full-page reload
  • fswatch: A general purpose solution that lets you trigger directory monitoring on-the-fly.
  • reload.sh: A shell script written by yours truly. It is basically a wrapper around fswatch that lets you specify defaults and has added support for triggering browser reloads on OS-X.

One interesting aside is I’ve noticed that none of these solutions (even mine) really give me exactly what I want. Sometimes the detection of when a file has changed is slow and I’m left staring at the browser waiting on it to pick up the change, until I get frustrated and click it myself (thus defeating the purpose of the directory monitoring)*. Other times I will save a file and I won’t want it to refresh the browser. Or I’ll save 5 files and only want a refresh on the last save. Still, it’s worth it for those times when I make a big change then switch over to my browser and the refreshed code is there waiting for me.

*One notable instance of when tests take a long time to run in response to a change occurs when VMs are in the mix. I’ve noticed that if you are monitoring a very large directory on a VM which is using a shared file-system such as NFS, changes can take tens of seconds to be detected. A great workaround is to monitor the large directory natively (outside of the VM) where directory monitoring performs well. When the change is detected, jiggle a file in a small, otherwise empty directory and then have the VM monitor that. This generally performs much better.

Case Study: Writing reload.sh

I wrote reload.sh because I wanted to be able to set up browser reloading on-the-fly without needing to configure a Gruntfile. Also, it was a great excuse to write something in bash. My process for writing the script is itself illustrative of several steps I took to shorten the path to feedback.

While it would have been ideal to write tests for reload.sh then use it to trigger tests against a separate instance of itself, this wasn’t to be. This is unfortunate as it would’ve been super-meta to be able to use this script on itself. Instead, I opted to simply have it execute an “echo foo” when a file changed so I could see it was working. This initially resulted in the following workflow during development:

  1. Make a change to the script
  2. Click into a console window and run reload.sh (ask it to monitor /tmp/ and “echo foo” when it detects a file change)
  3. Click into a second console window and “touch /tmp/myself”. Watch for the output on console 1.
  4. Click back into the original console window and hit control-c to exit the script so I’m ready for my next change

This proved cumbersome but I quickly hit on a way to remove steps 3 & 4. I wrote a separate shell script to touch myself and then kill myself so I wouldn’t have to do it by hand:

touch /tmp/myself ; sleep 2; pkill -f reload.sh

The only downside to this workflow was that because reload.sh ran in a loop, I had to invoke my second script from within my first script (which meant I had one line in my script that I’d have to remove or comment out prior to committing it back into version control).

In Conclusion

Shortening the feedback path may not always save you time, but it will save your sanity. Being able to immediately see the results of your work creates a much more pleasant experience. Use automation and whatever hacks you need to keep yourself from needing to click on anything. You’ll be better off for it.

 

Conways Game of Life

Several months ago I was interviewing with various companies, primarily for senior web developer roles. In total I probably spoke with just under a dozen companies. Some had phone screens and some had in-person interviews. Some had written tests and some used white boards. My favorite of all the different interview approaches were the companies who asked me to build something. And of the companies who asked me to build something, my favorite was being asked to build Conway’s Game of Life (wikipedia link).

Conway_s_Life

Before I get into Conway’s Life (the main thrust of this blog post), let me briefly comment on why open-ended assignments are the way to go when it comes to job interviews:

  • Do you want to see how well someone can bullshit? Or do you want to see how well they can code?
  • Assigning a project measures not only technical prowess, but dedication and interest in the position
  • Instead of assessing how well someone will do on tasks tangentially related to the position, why not see how well they’ll actually do with something more real-world?
  • Best of all, it’s fun

Of the three interviews in which I was given projects, two were open-ended. Of the two, my favorite one was this: “Build an implementation of Conway’s Game of Life”.

In a nutshell, Conway’s Life is a programming exercise in which a grid of cells is implemented. Each cell can be either on or off. The number of adjacent cells which are “on” determines whether or not a given cell is toggled on during each “turn”. This simple logic can result in fairly complicated patterns emerging from an extremely simple starting point.

My implementation was written in Javascript and HTML.

Check it out by visiting this page and clicking Play.

Here are a few interesting points about my design:

  • I split the design into two modules. The life class implements an internal state of the Life board and the lifedom class handles exposing this state via the DOM.
  • The animation in lifedom works by toggling the background of the table cells on and off. Originally, the animation was so fast that I actually needed to slow it down (via setTimeout) to create a more aesthetically appealing effect. I added a slider that lets you slow down or speed up the animation by changing the value of the timeout.
  • I added unit tests. They make sense for the life class, less so for lifedom, so I got lazy as I added various bells and whistles to the latter.
  • Because I originally wanted tests for lifedom, it lead me to a regrettable design decision: there is HTML in the javascript. I wanted to avoid duplication and only define the HTML of the game board once. Generally speaking, there are 3 ways to do this:
    • Use PHP to define an HTML template to use in both the unit tests and the application. Pro: Easy. Con: Requires PHP.
    • Use Javascript for the template. Pro: Easy. Con: Ugly. Clunky.
    • Use Grunt to manage different builds for test versus production. Pro: Best pure javascript solution. Con: More complicated than the alternatives.

Performance

For me, the most interesting part of this project came shortly after I finished my first implementation. I decided to check it out on my 1st generation iPad Mini. The animation looked horrible. The “frame rate” was way too low and it looked visibly underpowered and stuttered. I tried reducing the timeout to try and speed things up, but this didn’t help: entire frames would be dropped and it still looked like shit.

It’s funny because I knew immediately how this would go down:

  • I’d first trust my gut and try a quick fix
  • This wouldn’t work so I’d get more analytical and go through looking for additional optimizations
  • I would once again fail and resort to replacing the animation with canvas as a last ditch effort to get something that looked decent. I had no idea if this would work.

My first attempt at better performance was to replace .toggleClass(‘on off’) with direct manipulation of the backgroundColor property. My rationale was that this would avoid CSS repaints and perform better. I also added some caching of which cells were “on”, which avoided a jQuery find() operation. You can see the github changeset here.

I added some basic benchmarking to measure the impact (or lack thereof) of this change. I created a running counter of “turns per second” which turned out to be a great benchmark. The less the animation slowed down the program, the higher the turns per second would be.

To measure each change, I used the R-Pentomino pattern and grabbed my benchmark at turn 150, which represents a peak for the pattern in terms of the number of enabled cells in the pattern (and consequent steep drop in animation performance on the iPad Mini). By comparing the Baseline/Control stats against a Test Run that has all the animation suppressed (such that each turn is calculated in the life class but never displayed on-screen) you can see how much the animation itself actually slows things down.

Baseline/Control Stats

  • Chrome on Macbook: 16tps
  • iOS Simulator on Macbook: 16tps
  • iPad Mini (1st generation): 5tps
  • iPhone 5S: 14tps

Animation Disabled

  • Chrome on Macbook: 19tps
  • iOS Simulator on Macbook: 19tps
  • iPad Mini (1st generation): 17tps
  • iPhone 5S: 18tps

Next, we see the impact of my first attempt at improving the animation performance by replacing toggleClass(‘on off’) with backgroundColor. It helped, but not enough.

Classes Removed

  • Chrome on Macbook: 18tps
  • iOS Simulator on Macbook: 19tps
  • iPad Mini (1st generation): 7tps
  • iPhone 5S: 16tps

I tried some additional optimizations which didn’t really do much:

  • Instead of turning the entire board off and then turning cells on, figure out which ones should remain on and don’t turn them off in the first place: commit.
  • Keep the internal array state of life sorted so we iterate less: commit.
  • Replace document.getElementById with a cache so it only gets called the first time: commit. This last change made a tiny difference, at least on the iPhone.

Cached DOM Lookups

  • Chrome on Macbook: 18tps
  • iOS Simulator on Macbook: 19tps
  • iPad Mini (1st generation): 7tps
  • iPhone 5S: 17tps

At this point, I knew my only hope of getting acceptable performance would be to drop the idea of using a

altogether and swap out the table with a when the user clicks Play and use that for the animation (changeset here). The results were pretty interesting and gratifying. Most importantly, it gets performance to an acceptable level on the iPad Mini.

Canvas Animation

  • Chrome on Macbook: 18tps
  • iOS Simulator on Macbook: 19tps
  • iPad Mini (1st generation): 11tps
  • iPhone 5S: 16tps

By reducing the timeout value, I’m able to get the turns per second higher on an iPad without dropping frames. Note how it actually performs slightly worse on an iPhone 5S. Even on my iPad Mini, canvas actually starts out worse than the

solution (peaking at 17tps versus 18tps) before eventually overtaking it by turn 150.

If I were motivated to work on performance more, I might consider coding something that turns the timeout value down dynamically to respond to a low tps value. This would allow a consistent animation speed on both overpowered and underpowered devices — you would set the desired tps, rather than setting the timeout interval.

What’s Next

I’m happy with the performance aspect of my program at this point. I might eventually add some drag and drop functionality so you can drag gliders, lines, pentominos, etc. onto the game board to more easily build complex patterns. I also toyed with the idea of adding different colors to the animation such that areas of the game board which are more busy would be colored darker than the parts which are burned out.

If you’re bored, feel free to fork my implementation, add something cool and send me a pull request.

Runaway Intervals, Stuck Shut-Off Valves and Nacho Cheese

 

I had a pretty interesting bug come up at work a couple days ago. The bug itself and associated events illustrates several things: runaway setIntervals, why I love living on the East Coast, why you should sever connections to internal resources before hitting an external API, and, most of all, the sheer inventiveness of end users in finding any opportunity, however esoteric, to bring down your system.

Let’s start with the last one. This is one of my favorite things about where I currently work. I have the privilege of working for a company with a very large user base. Out of a total of over 100,000 customers, at any moment several thousand are logged into our site and cruising around, doing what they do. When you get into that large a number of people constantly hammering on your code, it tends to turn up really weird, obscure bugs that you never in a million years would have uncovered on your own, no matter how much testing you do pre-release. Here’s a contrived example…

Let’s say you have a contact form where a customer types a review of one of your products and you have some javascript that monitors what they type and periodically fires off Ajax requests. You may have tested it six ways to Sunday, using unit tests, integration tests, focus groups, the works. But once you unleash it into the wild, if you have a big enough install base, eventually you get someone like Adrian. Adrian is a Norwegian expatriate living in Japan who wants to enter a negative review of one of your nose hair trimmers. He’s seriously pissed off because that thing HURTS. While he types out his caustic review, he’s eating Nachos and drinking a 2010 Simi Valley merlot. He’s on his third glass and feeling a little tipsy. Before you know it, Adrian passes out after dripping Nacho cheese all over his keyboard, making it get stuck in a loop while typing the Japanese characters for pain (I was going to include them here parenthetically but WordPress converts them to question marks, so I’ll have to describe them for you instead. The first character looks like a fat guy in a bus stop; the second one is a kind of cursive capital H). Anyway, your javascript doesn’t like the fact that these keys have been depressed for over an hour and bad things happen. Now ask yourself… how could you possibly have anticipated this? Why would anyone ever intentionally keep a key down that long? Plus, what kind of person mixes red wine with nachos? I mean, it’s common knowledge you go with white wine when eating fish or cheese.

Anyway, it was someone like Adrian who almost caused a site outage for us last week. We use memcached as a caching mechansim on our site and our Dev Ops guys noticed a climbing number of connections from a couple of our servers. I started to look into the issue but then my wife called me and told me a pipe had burst at our house. We live near Philadelphia and we’d had single-digit temperatures for a couple days straight, resulting in a short piece of pipe on the outside of our house going to a hose faucet freezing up and bursting. I knew Dev Ops had the situation under control so I cleared it with my boss, and headed home to deal with my plumbing issue.

A helpful neighbor had alerted my wife to the problem; he’d then come inside and tried to close the valve leading to the faucet, but it was rusted shut, so he had closed the main shut-off instead. I couldn’t close the valve either to be able to safely turn the water back on, so I drained the pipes and took off the bottom half of the stuck valve. It was crazy old but I figured I might be able to get a replacement for just that piece and not have to pay a plumber to solder a new shut-off into place. I spent an hour going to both Lowe’s and Home Depot, but neither of them had the part — not surprising since Eisenhower was probably president when it was installed. Luckily, the one sales associate recommended I try some liquid wrench. After soaking the valve in the oil for five minutes, I was able to ease it closed, letting me close off the line to the broken faucet and turn the water back on to the rest of the house.

Which leads me to why I love living on the East Coast. I’m the type of person who needs four seasons to be happy. The brutal, painful, freezing cold of winter makes you appreciate summer all the more. I don’t understand how anyone can live in L.A. — warm weather all the time, women walking around in short-shorts in the middle of December, seeing celebrities when you stop by the mall to buy a new nose hair trimmer. Who wants to live like that? I’ll take pipes exploding and knuckles cracking open from the cold any day, since it means I never take the Jersey shore for granted.

So I get back into work the next day, having resolved my plumbing emergency in a way that makes me feel like a true Man’s man (even though all I did was close a fucking valve) and Jimmy brings me up to speed on what went down.

Our website has a Facebook integration that allows customers to link their Facebook account up to our service. Apparently, Michelle S. from Alabama had somehow managed to hammer away at our authorization endpoint 5 times per second for the better part of an hour. Before too long, this swamped memcached and brought us close to an outage, which Jimmy was able to avert by temporarily moving some data out of her account to silence the endpoint.

The nuts and bolts of what went wrong illustrates the right and wrong way to architect an endpoint that queries an external service. This particular endpoint assembles data from three different resources: our database, our memcached cluster and Facebook’s external API. There is a central design philosophy that you need to adhere to when working with this kind of system: sever all connections to your own internal resources before you open up any connection to an external API. Let’s say Facebook has issues on their end and calls to their endpoint don’t complete for 90 seconds. If you open a database connection to do some queries on your site and don’t bother to close it before issuing the API call to Facebook, that 90 seconds that you’re waiting on Facebook is an extra 90 seconds that that database connection is being held open. Multiply that by the number of people trying to use your Facebook integration at that particular moment. Then multiply it again by 2 or 3 to account for people getting fed up waiting and hitting refresh every five seconds to try and get the damn page to load. Before you know it, your database connections are all gone, and your customers are seeing that hilarious graphic of the nerdy looking hamsters running around the colo cage that your design department came up with last week to display during a site outage. Who knew it would come in handy so soon?

A prior outage had led us to ensure our code closes the database connections before querying Facebook, but we had neglected to do the same for memcached. Still, this wasn’t the only bug that had come into play; there was also a frontend bug that had caused those requests to hammer away at us, every 200ms like clockwork. And this illustrates another important point: Sometimes, when something goes wrong, if you dig deep enough, it’s actually several somethings. And you should find and fix them all.

Here’s how frontends for authorizing a Facebook application typically work. You click a link on the site and it pops open a new window displaying Facebook’s “authorize app” page. The javascript on the page uses setInterval to register a method to run every 200ms to check if the authorization pop-up has been closed. Once you click to authorize the app, the window closes and your Facebook profile can then be accessed via an ajax call to the integration endpoint. Here is some pseudo-code I wrote that shows the general process flow:

Web_-_sites_aweber_app_views_layouts_default_thtml_-_Aptana_Studio_3_-__Users_dand_Documents_Aptana_Studio_3_Workspace

There’s some code that isn’t shown that opens the new window to Facebook’s auth page when you click on a button in the #Load-Facebook-Div and then calls the above init() method. So basically:

  1. Click a button in a DIV to open the new Facebook window
  2. The init() method replaces the button with a loading message and polls every 200ms to see if the Facebook window is closed (meaning the auth happened).
  3. Once the window has been closed, the interval is canceled and an ajax request is triggered to fetch the newly loaded Facebook profile details out of the database and show it on the page

It seemed pretty obvious that the high frequency of the customer hammering on the endpoint implicated the 200ms setInterval call. But what exactly was going on? I looked closer at the code and noticed what appeared to be a minor race condition. We are storing the return value of the setInterval call in “this.interval” so we can cancel it later. But what happens if a second interval were created? The reference to the first interval would be overwritten and it would never be canceled and would keep firing forever. But how could I make this happen?

My first thought was that a double-click on the button might do it. Maybe the customer clicked it so fast that they snuck a second click in before the javascript could hide the button and prevent a second request. But no matter how fast I clicked on it, I couldn’t trigger this second request.

I briefly considered what would happen if a customer had two tabs open in their browser, loading this page twice and then clicked the Facebook integration on each one. But, I dismissed that, since each tab would have it’s own “window” object and maintain state independently.

In the end, I was able to make the Ajax request fire every 200ms by first clicking the link to start up an initial interval. Then I went into Chrome Dev Console and entered Namespace.facebookIntegration.init() to manually trigger a second setInterval that would overwrite the reference to the first. Sure enough, when I closed the Facebook window, the orphaned interval began hammering away on the endpoint.

I had proved that a lack of defensive programming on the frontend could result in a runaway interval that sent large amounts of traffic to our integration endpoint. But I still didn’t know exactly what Michelle S. in Alabama had done to uncover the bug. I’m going to assume that she didn’t de-minify our code, pour over it to find race conditions then open Chrome Dev Console and manually call the endpoint a second time just to fuck with us.

Even though I didn’t know exactly what caused the bug, I knew how to fix it. There are 2 ways actually:

  1. Try to cancel any pending interval before opening another one, rather than assuming it’s already been canceled
  2. Make this.interval an array so it can support tracking multiple intervals. When the Facebook page is closed, clear all the intervals, rather than just the last one

This frontend bug is a great example of a class of bugs which I rarely encounter and find somewhat frustrating when I do: a bug that you are able to fix without fully understanding what’s causing it. It’s always better to have a complete understanding of the scope and progression of a problem to ensure that your solution adequately addresses everything that went wrong. But sometimes you simply can’t know for sure exactly what went down.

In the end, if you’re able to make your frontend code cleanly handle someone triggering hundreds of ajax lookups because a key on their keyboard is held depressed for over an hour, it really isn’t necessary to know that it was nacho cheese sauce that was holding down the key.

Validanguage and Node.js

It’s been over 2 years since I last updated my Validanguage javascript library. In that time, several interesting trends emerged in the world of Javascript. One of them is the introduction of HTML 5’s form validation support. Without getting into too much detail, I feel that HTML 5’s built-in validation is a good fit for many applications, but other sites will likely still benefit from a more full-featured form validation library. The second and third major trends in the past couple years has been the rise of Node.js and single-page web apps.

Validanguage was designed for the Web 2.0 world: websites that still use standard forms but augment them with occasional ajax requests to fetch supplemental data. In 2014, however, the bleeding edge has shifted to implementing entire sites as a single app within a framework such as Backbone.js, Ember or AngularJS; and, often, doing away with forms entirely.

Tentative Roadmap

  • Add a shitload of unit tests via Node.js/Grunt/Jasmine/PhantomJS. Manual testing of new versions of Validanguage is excruciating and I need to automate this if I’m ever going to make progress on any eventual refactoring.
  • To fit within the context of a Node.js-based single-page web app, validation rules should be defined once (in a model). This model will then expose the validation rules for use on both the server and client. If the validation needs to change, you change it once in the model and both server and client are updated.
  • My vision for future versions of Validanguage is for it to retain its non-reliance on a particular framework. To the extent that it can be easily integrated into existing libraries or frameworks, I’ll see what I can do, but I’m not intending it to ever require Node.js, jQuery, Angular, etc.
  • Currently, Validanguage relies on form tags. I’d like to do away with this reliance and allow validation behavior to be defined on arbitrary “groups” of DOM elements/form controls.
  • Similarly, Validanguage is bound to a form control via DOM ID. I’d like it to support using arbitrary selectors as well.
  • Validanguage should be compatible with asynchronous loading and implemented as an AMD module.

I’m starting with the first two bullet points. You can view my progress on Validanguage’s github page, which is finally seeing some action. I was very fortunate in my choice of JSON and HTML comments as the two methods for defining validation rules in Validanguage. Using Validanguage’s JSON API within a model on the backend (in Node.js) will work out perfectly. Validanguage middleware (installable via npm) can interpret and execute the rules to validate content posted to the server. To implement the rules client-side, custom helpers for Handlebars, Jade and other template languages can export the rules from the model into either script tags or the HTML markup in comment tags, which will play nicely with asynchronous loading. This approach works well in the CakePHP Validanguage helper I’d written years ago.

I’m really excited to see where this takes me. If you have any interest in collaborating with me, please make sure to get in touch. Now that I’m doing all new development against the github project, it should be pretty easy to team up with other interested developers to start moving this forward. As of last night, I have Jasmine installed in the Node project and I’m ready to start grinding out unit tests before getting started on the fun stuff.

Validanguage_—_bash_—_78×22_and_Oracle_VM_VirtualBox_Manager

Ubuntu Ink Levels

Using Ubuntu is somewhat of a masochistic choice.

I have one desktop in the basement running Ubuntu, with an HP printer shared to my network over CUPS. Pages were printing faded so I knew I had to replace either the black ink, color ink or both. Unfortunately, the printer settings in the main settings pane said “Ink levels cannot be detected” or some such.

So I spent 20 minutes googling around and installing stuff to try and find out how to view my ink levels. I wasn’t wearing my glasses so I’m squinting at the screen and cursing myself for being so lazy that I won’t run upstairs to grab them.

Finally, I discover there is a program called hp-toolbox that is successfully able to display the ink levels — I need to replace both cartridges after all.

The kicker is I spent 20 minutes trying to figure this out, only to realize I need to use a program I already had installed. Furthermore, the intense feeling of Deja Vu suggests that I did this exact same thing 12 months ago when I last had to view the ink levels.

All this has happened before, and will happen again…