The Funniest Incident Postmortem

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Getting rpcbind to work without IPv6

This advice is going to be useful to a small subset of folks. But it’s useful nonetheless. With us being nearly exhausted of IPv4 addresses, we should probably not be disabling IPv6, but there are some rare situations where some tests depend on IPV4 only. The Glusterfs regression test framework makes a lot of assumptions. One of them is that the network is always an IPV4 network. Gluster does work with IPv6. However our tests and related regular expressions haven’t yet moved to IPv6.

We’re in the process of moving cloud providers. Every time we move, we run into some trouble with server setup. There’s some setup that’s different in base images across the spectrum. Every time, we run into a trouble with rpcbind refusing to start. Every time, we think we have it figured out and automated it away. This time we found a new way it could break!

Generally, this is how you disable IPV6:

  • Add IPV6INIT=noline in /etc/sysconfig/network-scripts/ifcfg-eth0
  • Add NETWORKING_IPV6=noline in /etc/sysconfig/network
  • Run sysctl net.ipv6.conf.all.disable_ipv6=1
  • Run sysctl net.ipv6.conf.default.disable_ipv6=1

After you disable IPv6, rpcbind will fail with the following error:

rpcbind.socket failed to listen on sockets: Address family not supported by protocol

To fix the error you need to reboot with dracut -v -f and reboot. This process is described on the Red Hat Knowledgebase and has worked for us in the past.

In the new provider, we ran into the same error despite doing that. What we discovered is that we need also remove all /etc/hosts entry that have ::1 in them. Because, if a reverse DNS entry converts to an IPV6 entry, that causes rpcbind to try to make IPv6 connections and the error looks just as though you did not run the dracut -v -f command.

Catching up with Infrastructure Debt

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why..

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:

  • We run a pretty old version of Gerrit on CentOS 5.
  • We run a pretty old version of Jenkins on CentOS 6.
  • We run CentOS 6 for all our regressions machines.
  • We run CentOS 6 for all our build machines.
  • We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
  • We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.

That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:

  • Upgraded Gerrit to the then latest version.
  • Setup Gerrit staging to test newer versions regularly for scheduling migration.
  • Created new CentOS 7 VMs on our hardware and moved the builds in there.
  • Moved Gerrit over to a new CentOS 7 host.
  • Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
  • Upgraded Jenkins to the latest LTS.
  • Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)

If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.

Weird IE8 error. Nginx to the rescue!

As a server side developer, I don’t run into IE-specific errors very often. Last month, I ran into a very specific error, which is spectacular by itself. IE8 does not like downloads with cache control headers. The client has plenty of IE8 users and preferred we serve over HTTP for IE8 so that the site worked for sure.

Nginx has a very handy module called ngx_http_browser_module to help! All that I needed to do was less than 10 lines of Nginx config.

location / {     # every browser is to be considered modern     modern_browser unlisted;     # these particular browsers are ancient     ancient_browser "MSIE 6.0" "MSIE 7.0" "MSIE 8.0";     # redirect to HTTP if ancient     if ($ancient_browser) {         return 301 http://$server_name$request_uri;     }     # handle requests that are not redirected     proxy_pass http://127.0.0.1:8080;     proxy_set_header X-Forwarded-For $remote_addr;     proxy_set_header Host $host;     proxy_set_header X-Forwarded-Proto $scheme; } 
It's Magic GIF

Yet another day I’m surprised by Nginx 🙂

Arrrgh! Tracebacks and Exceptions

My colleague asked me to take a look at a logging issue on a server last week. He noticed that the error logs had way too little information about exceptions. In this particular instance, we had switched to Nginx + gunicorn instead of our…

My colleague asked me to take a look at a logging issue on a server last week. He noticed that the error logs had way too little information about exceptions. In this particular instance, we had switched to Nginx + gunicorn instead of our usual Nginx + Apache + mod_wsgi (yeah, we’re weird). I took a quick look this morning and everything looked exactly like they should. I’ve read up more gunicorn docs today than I’ve ever done, I think.

Eventually, I asked my colleague Tryggvi for help. I needed a third person to tell me if I was making an obvious mistake. He asked me if I tried running gunicorn without supervisor, which I hadn’t. I tried that locally first, and it worked! I was all set to blame supervisor for my woes and tried it on production. Nope. No luck. As any good sysadmin would do, I checked if the versions matched and they did. CKAN itself has it’s dependencies frozen, this lead to more confusion in my brain. It didn’t make sense.

I started looking at the Exception in more detail, there was a note about email not working and the actual traceback. Well, since I didn’t actually have a mail server on my local machine, I commented those configs out, and now I just had the right Traceback. A few minutes later, it dawned on me. It’s a Pylons “feature”. The full traceback is printed to stdout if and only if there’s no email handling. Our default configs have an email configured and our servers have postfix installed on them and all the errors go to an email alias that’s way too noisy to be useful (Sentry. Soon). I went and commented out the relevant bits of configuration and voilà, it works!

Palm Face

Image source: Unknown, but provided by Tryggvi 🙂