Screwups are important

02 Aug 2012 - sysadmin mozilla

At my day job, one of the tasks that Kiran and I have to do frequently is to watch the job posting on the HasGeek Job Board and reject the ones that don’t conform to our Terms of Service. Last night, Kiran pinged me on IRC with a link to a job that he wanted me to knock off. Usually, I use phpmyadmin to do this, but I thought I’d turn it into a bash script and started writing a mysql update query.

Head in Hands by Alex E Proimos on Flickr

In retrospect, that’s probably my first mistake right there. Production server is really not the right place to have done this and I don’t even know why I thought it was a good idea. I wrote the query and executed it. Suddenly, I realized that I screwed up. I had that sinking feeling where you know exactly what went wrong and that it’s entirely your fault. I forgot the WHERE clause in the query and managed to reject every job in the job board. Thankfully, there was a backup handy, from about 10 minutes earlier too. Because of the power outages in North India and the fact that our servers are hosted by E2E Networks in Delhi, I had set up hourly backups earlier in the day. Quickly, I brought down apache and started restoring from the backup. In about 10 minutes from executing the wrong query, we were back and running.

We had two things to take away from this mistake – modifying the database directly should stop, and hourly backups are a good. We don’t have a lot of data yet, and hourly backups don’t take a lot of time. I spent all day today writing code so that we don’t touch the database manually ever for this. There’s been a plan to write this code for months, but true motivation came in the form of this embarrassing mistake. Most of us hate admitting our mistakes, but when working on servers, it is essential we move to a culture of blameless post-mortems to fix broken systems and ensure the same mistakes aren’t repeated or at least occur less frequently.