Blog

  • What I learned about the cloud

    My day job primarily involves maintaining the bunch of Ubuntu servers. What the last few months has taught me is to plan for failure. With the ‘cloud’ being everywhere, we’re probably in a false sense of security. I’ve 3 anecdotes to share from my brief experience.

    One – One of our database instances needed to be restarted. After the restart, we couldn’t connect to it. It took a few hours for Support to get back to us (it was an Amazon RDS instance) and figure out what was the problem. Our init_connect parameter, in which we put a hack for timezone, was causing the failure to connect after a restart.

    Two – One machine randomly died due to hardware failure on the host. Luckily, I had just launched a new instance which was meant to replace it eventually. Within a few minutes, I switched the IP address with the new instance. Thankfully, there was no service disruption.

    Three – I have a VPS with a small provider. This is the VPS that powers this blog and my IRC session. In the first week of July, the provider notified me that there was some targeted network attack happening on two of their hosts (one of which hosts my VPS) and they’ll be power cycling the hosts several times a day. This, of course, brought down my website (for a short while) and my IRC session (until I manually started it). Note that my website is not high-availability or or hosted with one of the major providers.

    I don’t blame the providers for any of these failures/issues. It is and always will be the responsibility of the customer to make sure there are backups and disaster recovery plans in place because the only thing servers consistently do is fail. It maybe after 1 hour, 1 week, 1 month, or a decade. But they fail. Eventually.

    At a recent conference I attended, there was a whole session about planning for failure. This may include making sure that you have backup servers, new servers can be brought up quickly in an automated manner, making sure there is no dependency on a single provider or service, making sure the application handles not being able to access another machine gracefully, and much more. I’ve agonizingly gone over disaster scenarios over the past few days, situations in which any of the servers go down, whether it be App server, DB server, or Monitoring server, or Load Balancer, or even the entire data center, and in conclusion, all I have to say is ‘Prepare for failure.’

  • Shooting myself in the foot with Apparmor

    The other day at work, I was working on setting up a new database server. This is the first time in a while we’re doing this. Almost no-one remembers who or how it was done the last time. Our data is kinda big, so we tend to put the mysql data files into an EBS volume by itself so that we always have the data separate from the machine and because we get as much space as want. We created the new machine, new disk, changed the path of the data folders, and started mysql. BAM! It threw a whole bunch of errors about permissions.

    I went in and checked the ownership, which seemed to be correct, but re-ownershipped everything anyway. Tried again. Nope, didn’t work. Out of frustration, tried again after doing a chmod -R 777. Still failed. For a while, we googled extensively for the error, leading us to nothing much to go on. Before this, we had some backup stuff to do, so I think it was close to 1 am when we actually got down to troubleshooting this. After sometime, we had the sense to google for what we wanted to accomplish, leading me to apparmor.

    Then, my memory kicked in about Apparmor and what it did. I figured out that mysql probably didn’t have permission to use other directories. We gave it permissions and it worked! But, we ended up not having enough time to restore data on this new server in and rotate out the old server. Overall, we were working on this from 12 am to 4 am. The next day, my QOTD was from my friend, who shall not be named – ‘Oops. That said, it’s happened to me, too. The irony bit is that I’m one of the primary upstream apparmor devs.’

  • RTFD and Summit

    Writing documentation isn’t easy. And maintaining up-to-date documentation isn’t easy either. readthedocs.org is a Django project which was written as part of Django Dash It is backed by RevSys, Python Software Foundation, and Mozilla Webdev. We can write our docs in Sphinx and import it into Read the Docs.

    I’ve just got it setup for summit. New contributors to Summit can see its developer documentation at summit.rtfd.org

  • Summit improvements and bug fixes

    ‘If I do that, I might break Summit!’

    That’s something often heard at UDSes by organizers. Indeed, Summit has historically had stability issues, especially during the high-usage week of a UDS. But Summit is starting to outgrow it’s troubled youth, gaining better code, better testing, and most importantly, more stability.

    The Summit team, consisting of Michael Hall, Chris Johnston, and I, has been hacking on Summit much more this cycle than ever. We even had a few new contributors this cycle. Our focus this cycle was to make it more stable first, and then more usable. There are lots of UI fixes that people have requested. We only haven’t gotten to them because we want Summit to be very stable this cycle. If you’d like to help us make Summit more awesome, please file bugs on things that you think Summit should do or places were Summit sucks. We can’t promise anything, but its great to nail down things we should fix.

    The Summit team has fixed a whole bunch of bugs this cycle. Big shout out to Chris Johnston and Michael Hall for setting the speed of development early on. Before I reached home back from UDS, there were I think 4 MPs for Summit (!?!). We’d also like to thank Maris Fogel for helping us setup a test framework. We’d like Summit to be more stable and having unit tests drives us in this direction.

    We’ve been busying making Summit much more awesome!

  • Helping with breakpad

    Wednesday was a fun day. I finally decided to take the plunge and step in and help with Breakpad Fine day to make that decision too, since the Breakpad status meetings are on Wednesdays at 11 am Pacific Time. I ended up being on the call via Google Voice. (Side note: Skype on linux had problems with Mozilla toll-free number).

    I now have editbugs privilege on bugzilla and I already fixed my first bug on Breakpad!