The Funniest Incident Postmortem

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Building Gluster with Address Sanitizer

We occasionally find leaks in Gluster via bugs filed by users and customers. We definitely have benefits from checking for memory leaks and address…

We occasionally find leaks in Gluster via bugs filed by users and customers. We definitely have benefits from checking for memory leaks and address corruption ourselves. The usual way has been to run it under valgrind. With ASAN, the difference is we can compile the binary with ASAN and then anyone can run their tests on top of this binary and it should crash in case it comes across a memory leak or memory corruption. We’ve fixed at least one bug with the traceback from ASAN.

Here’s how you run Gluster under ASAN.

./ ./configure --enable-gnfs --enable-debug --silent make install CFLAGS="-g -O0 -fsanitize-recover=address -fsanitize=address" -j 4 

You need to make sure you have libasan installed or else this might error out. Once this is done, compile and install like you would normally. Now run tests and see how it works. There are problems with this approach though. If there’s a leak in cli, it’s going to complain about it all the time. The noise doesn’t imply that fixing that is important. The Gluster CLI is going away soon. Additionally, the CLI isn’t a long running daemon. It’s started, does it’s job, and dies immediately.

The tricky part though is catches memory you’ve forgotten to free. It does not catch memory that you’ve allocated unnecessarily. In the near future, I want to create downloadable RPMs which you can download and run tests against.

The configuration I’ve setup lets you continue to run the program after the first memory corruption by setting the environment variable ASAN_OPTIONS=halt_on_error=0. If you find an existing leak you are not interested in fixing, you can suppress it. More information on the wiki page

Static Analysis for Gluster

Static analysis programs are quite useful, but also prone to false positives. It’s really hard to…

Static analysis programs are quite useful, but also prone to false positives. It’s really hard to keep track of static analysis failures on a fairly large project. We’ve looked at several approaches in the past. The one that we used to do was to publish a report every day which people could look at if they wished. This guaranteed that nobody looked at it. Despite knowing where to look for it, even I barely looked at it.

The second approach was to run them twice, before your patch is merged and after your patch is merged in. If the count goes up with your patch, the test fails. This has a problem that it doesn’t account for false positives. An argument could be made that you could go fix another static analysis failure in your patch. But that means your patch now does two things, which isn’t fun for when you want to do a backport, for instance. Or even for history purposes. That’s landing two unrelated changes in one patch.

The approach that we’ve now gone with is to have them run on a nightly basis with Jenkins. Deepshika did almost all the work for this and wrote about it on her blog. It has more details on the actual implementation. This puts all the results in one place for everyone to take a look at. Jenkins also gives us a visual view of what changed over the course of time, which wasn’t as easy in the past.

She’s working on further improving the visual look by uniting all the jobs that are tied to static analysis. That way, we’ll have a nightly pipeline run for each branch that will put all the tests we care about for a particular branch in one place.

Gluster Summit 2017

Right after Open Source Europe, we had Gluster Summit. It was a 2-day event with talks and BoFs. I had two key things to do at the Gluster Summit…

Right after Open Source Europe, we had Gluster Summit. It was a 2-day event with talks and BoFs. I had two key things to do at the Gluster Summit. One was build out the minnowboard setup to demo Tendrl. This didn’t work out. I had volunteered to help with the video work as well. According to my plans. The setup for minnowboards would take about 1h and then I’d be free to help with camera work. I had a talk scheduled for the second day of the event. I’d have expected one of these to two wrong. I didn’t expect all to go wrong 🙂

The venue had a balcony, which made for great photos

On the first day, Amar and I arrived early and did the camera setup. The venue staff were helpful. They gave us a line out from their audio setup for the camera. Our original plan was that speakers would have a lapel mic for the camera. That was prone to errors from speakers and also would need us to check batteries every few hours. When we first tried to work with the line in, we had interference. The camera power supply wasn’t grounded (there wasn’t even a ground out. The venue staff switched out the boxes they used for line out and it worked like a charm after that.

We did not have a good start for the demo. Jim had pre-setup the networking on the boards from home and brought them to Prague. But whatever we did, we couldn’t connect to it’s network the night before the event. That was the day we kept free to do this. That night we gave up, because we needed a monitor, an HDMI cable, and a keyboard to debug it. At the venue, we borrowed a keyboard and hooked up the board to the monitor. There was no user for dnsmasq, so it wasn’t assigning out IPs and that’s why the networking didn’t work. Once we got past that point, it was about getting the network to work with my laptop. That took a while. We decided to go with a server in the cloud as the Tendrl server. By evening, we got the playbook run and get everything installed and configured. But I’d made a mistake. I used IPs instead of FQDNs, so the dashboard wouldn’t work. This meant re-installing the whole setup. That’s the point where I gave up on it.

We even took the group picture from the balcony

My original content for my talk was to look at our releases. Especially to list out what we committed to at the start of the release and what we finished with. There is definitely a gap. This is common for software projects and how people estimate work. This topic was more or less covered on the first day. I instead focused on how we fail. How we fail our users, developers, and community. I followed the theme of my original talk a bit, pointing out that we can small large problems in smaller chunks.

We’re running a marathon, not a sprint.

Upgrading the Gluster Jenkins Server

I’ve been wanting to work on upgrading setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.

I’ve been wanting to work on upgrading setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.

We used the unix user accounts for access to Jenkins. This means Jenkins needs to read /etc/passwd and everyone has SSH access via passwords by default. Very often, the username wasn’t tied to an actual email address. I had to guess the account owner based on their usernames elsewhere. This was also open to brute force attacks. The only way to change passwords was to login to the server and run passwd command. We fixed this problem a few months ago by switching our auth to Github. Now access control is a Github group which gives you more permissions. Logging in will not give you any more permissions than not logging in.

Our todo list during the Jenkins upgrade

Jenkins community now recommends not running jobs on the master node at all. But our old setup depended on certain jobs always running on master. One by one, I’ve eliminated them so that they can now run on any node agent. The last job left is our release job. We make the tar from every release available on an FTP-like server. In our old setup, the this server and Jenkins were the same machine. The job ran on master and depended on them both being the same machine. We decided to split up the systems so we could take down Jenkins without any issue. We intend to fix this with an SCP command at the end of the release job to copy artifacts to the FTP-like server.

One of the Red Hat buildings in Brno

Now, we have a Jenkins setup that I’m happy with. At this point, we’ve fixed a vast majority of the annoying CI-related infra issues. In a few years, we’ll rip them all out and re-do them. For now, spending a week with my colleague in Brno working on an Infra sprint has been well worth our time and energy.