The Funniest Incident Postmortem

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Writing Golang as a Python Dev

I’ve gone through the Golang tutorial once before but in the last month or so, I fully dove into it. I started by writing a simple hello world web application. I found the implementation of the webserver so neat that most of the uses I’d have for a framework is redundant. The in-built libraries already take care of handling most of the use-cases I have. I did a couple of views and a couple of templates. It seems to be working well.

As someone coming from Python, I keep tripping over types. I started my professional career with PHP and then moved to Python. Both of these languages aren’t very strongly typed by default. So it’s been fun to find errors and fix them. I learn more and more that I can’t be lazy.

The other thing that’s tripped me over in go is the use of tab vs spaces. Python’s standard has been my second nature for years, so it’s fun to find out that’s not how you do things in Go. I’ve downloaded vim-go and that’s helped make most of it second nature.

All of this started in a fun way. I wrote a tiny flask app to try out kubernetes on Google Cloud. It’s a simple hello world app with some templates. I’ve figured out some k8s basics and now I see why having one binary helps. The next step in that learning process is to plug in my Go binary rather that copy in multiple files.

I’ve played around with the date libraries in Go now, and jeez, that’s a fun one 🙂 I’m used to specifying a specific format. The golang method is to tell you to specify the layout with a particular date instead. That took me quite a long time to figure out. I’ll admit, it’s definitely easier, but damn, it tripped me over for a while since I couldn’t believe what I was seeing.

One of the things I’ve done quite a lot is write web scrapers. I used colly briefly to do some scraping. It’s a very cool API. I did some decent scraping. It was quite easy to write. The bit I found amusing is that I found a bug in a dependent library and sent in a pull request for a fix.

I didn’t have to wait for it to be merged. I could compile my binary with my fixes myself. Of course, the side effect is every time one of my dependencies have a security bug fixed, I have to send a bug fix myself. That kinda sucks, but in the brave new world of containers, that’s the sort of thing I expect to happen.

This tiny experiment is slowly increasing in scope as I figure out what’s the next bit of new knowledge I need to gain. My current goal is to replace my python hello world web application in my kubernetes cluster with a golang one.

At some point in this process, I can also inject figuring out multi-stage builds. That should make the process of pushing out a new build to my hello-world app easier.

The other thing I’m thinking of is converting the salary converter backend to golang. If it works well, I might move it out to be hosted on a Kubernetes cluster rather than on a full-sized server. That should help split out some of the applications I run to be hosted in a shared but highly-available clusters that I don’t have to manage uptime.

Testing Ansible With Molecule

My colleague was recently assigned a task to create tests for an ansible role that she works on. She pinged me for help and we got started in figuring out what to do.

My colleague was recently assigned a task to create tests for an ansible role that she works on. She pinged me for help and we got started in figuring out what to do.

The first thing we attempted was to run tests inside docker using Ansible following the instructions in an Ansible.com blog post. The idea was we would run the role we wanted to test. Then run a second test playbook that would do a couple of asserts. I was stuck here for a bit for various reasons. The containers that are used in the blog post have not been updated in over a year. And we ran into some trouble trying to find a container with systemd running inside that was also public. The right way to do that would be to generate the container using a Dockerfile on the fly and run tests inside them. That was okay with me, but it added more complexity.

For two days or so, I briefly looked at the idea of doing this in VMs generated on the fly, but it added way too much overhead. Michael, my colleague, pointed me to molecule. His team has been using it regularly, though he himself hasn’t looked at it.

Molecule is an interesting project. It seems to do what I need, but there isn’t spectacular documentation on how to use it for a project that already exists. There are ascii videos, but I’m a fan of reading more than watching. Getting molecule to work on Fedora 28 was a bit of a pain. Ansible needs libselinux-python to work with Docker on a host that has selinux enabled. Now, you can’t installed libselinux-python from pip. It has to be installed from packages. I tried installing it in a virtualenv with site packages and installing molecule from packages, both of them failed in interesting ways that I’m yet to debug.

Eventually, I gave up and created a Centos 7 VM for this. A virtualenv with site packages actually worked inside my Centos 7 VM. This is great news, because this is the sort environment I expect to run molecule. The bit I really like about molecule is that it takes care of the harness and I can write asserts in Python. The tests will actually look like what Python tests look like. The bit I don’t like is that its documentation isn’t as thorough as I’d like. I plan to submit a pull request for the docs for a full flow on how to write tests with molecule. I found various blog posts on the internet that were far helpful. It took some guess work to realize that testinfra is it’s own python module and I should be looking at the module for documentation on how to write my own asserts. This is still a work in progress, but I expect a lot of our ansible pieces will end up being better tested now that we have this in place.

Moving from pyrax to libcloud: A story in 3 parts

Softserve is a service that lets our community loan machines to debug test failures. They create cloud VMs based on the image that we use for our test machines. We originally…

Softserve is a service that lets our community loan machines to debug test failures. They create cloud VMs based on the image that we use for our test machines. We originally used pyrax to create and delete VMs. I spent some time trying to re-do that with libcloud.

Part 1: Writing libcloud code

I started by writing a simple script that created a VM with libcloud. Then, I modified it to do an SSHKeyDeployment, and further re-wrote that code to work with MultiStepDeployment with two keys. Once I got that working, all I had left was to delete the server. All went well, I plugged in the code and pushed it. Have you seen the bug yet?

Part 2: Deepshikha tries to deploy it

Because I’m an idiot, I didn’t test my code with our application. Our tests don’t actually go and create a cloud server. We ran into bugs. We rolled back and I went about fixing them. The first one we ran into was installing dependencies. Turns out that the process of installing dependencies for libcloud was slightly more complicated for some reason (more on that later!). We needed to pull in a few new devel packages. I sat down and actually fixed all bugs I could trigger. Turns out, there were plenty.

Part 3: I find bugs

Now I ran into subtle installation bugs. Pip would throw up some weird error. The default Python on Centos 7 is pretty old. I upgraded pip and setuptools inside my virtualenv to see if that would solve the pip errors and it did. I suspect some newer packages depend on newer setuptools and pip features and it fails quite badly when those are older.

After that, I ran into an bug that was incredibly satisfying to debug. The logs had a traceback that said I wasn’t passing a string as pubkey. I couldn’t reproduce that bug locally. On my local setup, the type was str, so I had to debug on that server with a few print statements. It turns out that the variable had type unicode. Well, that’s weird. I don’t know why that’s happening – unicode sounds right, something is broken on my local setup. A check for “strings” in python2 should check for str and unicode. The code does the following check which returns False when pubkey is a unicode:

isinstance(pubkey, basestring) 

On a first glance, that looked right. On python2, str and unicode are instances of basestring. A bit of sleuthing later, I discovered that libcloud has their own overridden basestring. This does not consider unicode to be an instance of basestring. I found this definition of basestring for python2:

basestring = unicode = str 

That doesn’t work as I expect on Python 2. How did this ever work? Is it that almost everyone passes strings and never runs into this bug? I have a bug filed. When I figure out how to fix this correctly, I’ll send a patch. My first instinct is to replace it with a line that just says

basestring 

In python2, that should Just Work™. Link to code in case anyone is curious. One part of me is screaming to replace all of this with a library like six that will handle edge cases way better.

Remote Triggering a Jenkins Job

All I wanted to do was trigger a job on another Jenkins instance. Here’s all the things I tried and failed…

All I wanted to do was trigger a job on another Jenkins instance. Here’s all the things I tried.

  • Tried out a plugin. This plugin does not work
  • Forked the plugin and applied some of the patches that have been contributed.
  • I wrote Python code to do it.
  • I wanted to get a “Build Cause” working and since that didn’t work on the first few tries, I added it as a parameter.
Pretty much what I kept hitting

It turns out that what I thought was working wasn’t actually working. I wasn’t passing parameters to my remote job. It was using the defaults. The fix for this problem is the most hilarious that I’ve seen. Turns out if you use the crumbs API and Jenkins auth, you don’t need the token. This was a bit of a lovely discovery after all this pain.

Now I need to figure out how to follow the Jenkins job, i.e. get the console output on remote Jenkins in real time. I found a python script that does exactly that. I tested it and it works.