The Funniest Incident Postmortem

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Leave a Reply