My day job primarily involves maintaining the bunch of Ubuntu servers. What the last few months has taught me is to plan for failure. With the ‘cloud’ being everywhere, we’re probably in a false sense of security. I’ve 3 anecdotes to share from my brief experience.
One - One of our database instances needed to be restarted. After the restart,
we couldn’t connect to it. It took a few hours for Support to get back to us
(it was an Amazon RDS instance) and figure out what was the problem. Our
init_connect parameter, in which we put a hack for timezone, was causing the
failure to connect after a restart.
Two - One machine randomly died due to hardware failure on the host. Luckily, I had just launched a new instance which was meant to replace it eventually. Within a few minutes, I switched the IP address with the new instance. Thankfully, there was no service disruption.
Three - I have a VPS with a small provider. This is the VPS that powers this blog and my IRC session. In the first week of July, the provider notified me that there was some targeted network attack happening on two of their hosts (one of which hosts my VPS) and they’ll be power cycling the hosts several times a day. This, of course, brought down my website (for a short while) and my IRC session (until I manually started it). Note that my website is not high-availability or or hosted with one of the major providers.
I don’t blame the providers for any of these failures/issues. It is and always will be the responsibility of the customer to make sure there are backups and disaster recovery plans in place because the only thing servers consistently do is fail. It maybe after 1 hour, 1 week, 1 month, or a decade. But they fail. Eventually.
At a recent conference I attended, there was a whole session about planning for failure. This may include making sure that you have backup servers, new servers can be brought up quickly in an automated manner, making sure there is no dependency on a single provider or service, making sure the application handles not being able to access another machine gracefully, and much more. I’ve agonizingly gone over disaster scenarios over the past few days, situations in which any of the servers go down, whether it be App server, DB server, or Monitoring server, or Load Balancer, or even the entire data center, and in conclusion, all I have to say is ‘Prepare for failure.’