The Funniest Incident Postmortem

Recently, I had a chance to think about an outage that I debugged and fixed a few years ago that involves Jenkins and systemd (or in this case lack thereof!).

Generally, if you want to run a task at the end of every Jenkins job whether the job has passed or failed, you have two options. You could use trap and write a clean up function. I would highly recommend that you use trap. Or you could be like me and write a post-build publisher that would run a script if it finds the line “Building remotely” in the console output. It’s quite hacky, but since the first line of every job is “Building remotely“, it works. I used to depend on this for clean up on a couple of Jenkins jobs a while ago and later removed it because of this infamous outage.

The Problem

Let me preface this by stating and this happened due to a combination of factors that I don’t expect to repeat. We were using an old version of Jenkins on an old version of CentOS. This means, it was still using init scripts and not systemd. The init file is just a shell script.

If you didn’t already know, SSH tends to forward your LANG information to the environment you connect to and force that environment to be similar to your current locale. I use en_US, but my French sysadmin colleague uses fr_FR locale. Which mean if I connect to a server, I would have English errors messages and if he did, he would have French ones.

When my colleague restarted Jenkins on that fateful day, his environment leaked into the Jenkins init script possibly due to a bug. Voila! Jenkins now speaks French. This meant my clean up didn’t work anymore. Instead of “Building Remotely” we had “Construction à distance“. Obviously, all the jobs failed.

The Solution

I had to stop and start Jenkins again so it spoke English. We made plans to upgrade both the OS and Jenkins so we didn’t run into this specific bug again. Aside from making sure that Jenkins didn’t accidentally speak French again, we also removed the clean up script.

In this case, the the job was creating rpms using mock. We would run mock with sudo and that meant the rpms were owned by the root user and the jenkins user could not delete the rpms. My solution back then was to use ACLs to give the jenkins user write access to files in the Jenkins workspace folder irrespective of the real owner. You can read my original postmortem on the gluster-infra mailing list archives.

We are currently in the process of changing hosting providers. The fix with ACLs always seemed hacky to me and I wanted to take this chance to remove the ACLs entirely. I’ve just added the jenkins user to the mock group and we build rpms without using sudo. That solves all the problems much more cleanly.

But hey, it brings me great joy to say we had a bug where Jenkins spoke French and thus caused a fun day of debugging and fixing.

Getting rpcbind to work without IPv6

This advice is going to be useful to a small subset of folks. But it’s useful nonetheless. With us being nearly exhausted of IPv4 addresses, we should probably not be disabling IPv6, but there are some rare situations where some tests depend on IPV4 only. The Glusterfs regression test framework makes a lot of assumptions. One of them is that the network is always an IPV4 network. Gluster does work with IPv6. However our tests and related regular expressions haven’t yet moved to IPv6.

We’re in the process of moving cloud providers. Every time we move, we run into some trouble with server setup. There’s some setup that’s different in base images across the spectrum. Every time, we run into a trouble with rpcbind refusing to start. Every time, we think we have it figured out and automated it away. This time we found a new way it could break!

Generally, this is how you disable IPV6:

  • Add IPV6INIT=noline in /etc/sysconfig/network-scripts/ifcfg-eth0
  • Add NETWORKING_IPV6=noline in /etc/sysconfig/network
  • Run sysctl net.ipv6.conf.all.disable_ipv6=1
  • Run sysctl net.ipv6.conf.default.disable_ipv6=1

After you disable IPv6, rpcbind will fail with the following error:

rpcbind.socket failed to listen on sockets: Address family not supported by protocol

To fix the error you need to reboot with dracut -v -f and reboot. This process is described on the Red Hat Knowledgebase and has worked for us in the past.

In the new provider, we ran into the same error despite doing that. What we discovered is that we need also remove all /etc/hosts entry that have ::1 in them. Because, if a reverse DNS entry converts to an IPV6 entry, that causes rpcbind to try to make IPv6 connections and the error looks just as though you did not run the dracut -v -f command.

My Personal Productivity System

With everyone having a phone, there’s a whole bunch of online todo lists and tools that help you keep track of your life and improve your productivity. I’ve tried a whole bunch of them. I’ve actually cycled through the entire lot. I’ve used almost all the tools, I’ve done a bunch of paper methods and the one that’s stuck so far is a modified version of Bullet Journal + Chunk Scheduling my day according to the strategies that Deep Work suggests. It’s still a challenge to do it every day and to get it right.

I’m serial procrastinator. I need some system to make my life work. The bit I’m most likely to put off is when it involves needing to talk to someone else. Especially more so when it involves having to go to a government office, talk to someone, and get a task done. This makes for fun when I want to travel, because all the visa stuff requires me to do a series of in-person interactions to make happen.My bullet journal notebook and schedule notebook

My systems works with my listing down the tasks I want to do at the start of the day. Then I schedule it into one of the 2-hour blocks I’ve divided my day into. Some blocks are fixed and will always have the same things. The first 2-hour block after I wake up is when I meditate and write. This is also the block where I workout. The next 2h block is where I shower, have breakfast, and start work. Then the work blocks begin. At some point I take a 30 min lunch break and after lunch, I sit down and look at what I’ve done so far and rejig my day around if it needs to happen.

During the day, any incoming requests gets added to my Backlog on Google Keep. I decide after my current task if I’m doing it today, this week, or later. I will then catch it during a weekly review.

I’ve been using this system for a few months now. It’s worked great on days when I’m at a 100%. On some days, I’m sleep deprived or I wake up really late. That throws a spanner into the whole thing. I end up not being able to focus or really get anything done. At that point, I usually start with a Pomodoro timer. I like using it for about 2h, after which I can focus.

Pomodoro does not work for me throughout the day because if I’m coding or debugging, I’d rather not step away from the problem. It helps to be in that state of mind throughout, rather than take breaks. If I’m doing a task that I find tedious, I find Pomodoro very useful. In September, when I was working on submitting assignments for my degree program, I found having a Pomodoro timer made sure that I would have structured breaks in my schedule.

I’m still learning and tweaking this system. At the end of the year, I’ll write another review of how well this has been going for me.

Filling the Gaps in My Knowledge

I started working as a sysadmin just as cloud really took off. I wasn’t really exposed to a lot of the networking minutiae. That was over 9 years ago. I never had to deal with something complicated in the world of networking. I stuck in my Linux lane and never wandered over to the networking lane. I knew some of the basics, but nothing further. It’s been 8 years and I’ve realized that it’s held me back a bit. One of the changes I’ve made ever since I read the Google SRE book is how I approach technical problems. I’m no longer happy to stop at, “Look, I got it working” or “The bug is not in my code, it’s in the library or a layer above”. I want to figure out the root cause.

Recently, I read Julia Evans’ post abut learning skills and it reminded me that networking is something I don’t know very well yet. I’ve looked at books that explain some basics, but I haven’t really gone in depth to understand how the pieces fit together. I don’t have the pressure of learning to pass a competitive exam. I just want to learn so I can fill in the gaps in my knowledge. Just in time, LinkedIn had offered me free premium for a month which also gives me access to LinkedIn Learning. I spent some time looking for a reasonably good course on networking. It’s been a great watch! The course is actually for CompTIA’s Network+ exam, which I have no intention of writing at the moment. However, it presented a good explanation of networking and TCP/IP. I knew some of the topics individually, but I couldn’t tie all of my knowledge together yet. The few days of watching Networking videos has been great. I don’t understand everything in great depth, but I know most of it and I know where to look for more details.

I’ve been reading and listening a great deal about growth mindset and deliberate practice. I’ve been in the tech industry for the last 9 years. I don’t have a degree yet, and even when I finish my current degree, it will not be in computers. It was humbling at first to sit down and learn something from scratch. At the same time, it’s very relieving. I’m more confident that I can understand networks better. I had most of the networking debugging skills I needed, but now I understand the theory better as I debug problems. Similarly, as a python programmer, I’ve barely ever looked deeply into the Unix kernel. However, as a sysadmin, when I debug problems, I would need a more in-depth grasp of what goes on behind the scenes. In the later year, I spent some time reading Advanced Programming in the Unix Environment. I don’t have it committed to memory, but I’ve read it broadly enough to understand where to look. I’ve understood a lot about what happens for IO/Networking/Process Management in Unixes. It helps me appreciate what goes on in Gluster better. It also helps debug some of the more weirder errors that I might run into.

I write this out as a note to myself. There is no shame in sitting down to learn something that you don’t know. It’s not going to be easy, but I’m going to be grateful for it.

Writing Golang as a Python Dev

I’ve gone through the Golang tutorial once before but in the last month or so, I fully dove into it. I started by writing a simple hello world web application. I found the implementation of the webserver so neat that most of the uses I’d have for a framework is redundant. The in-built libraries already take care of handling most of the use-cases I have. I did a couple of views and a couple of templates. It seems to be working well.

As someone coming from Python, I keep tripping over types. I started my professional career with PHP and then moved to Python. Both of these languages aren’t very strongly typed by default. So it’s been fun to find errors and fix them. I learn more and more that I can’t be lazy.

The other thing that’s tripped me over in go is the use of tab vs spaces. Python’s standard has been my second nature for years, so it’s fun to find out that’s not how you do things in Go. I’ve downloaded vim-go and that’s helped make most of it second nature.

All of this started in a fun way. I wrote a tiny flask app to try out kubernetes on Google Cloud. It’s a simple hello world app with some templates. I’ve figured out some k8s basics and now I see why having one binary helps. The next step in that learning process is to plug in my Go binary rather that copy in multiple files.

I’ve played around with the date libraries in Go now, and jeez, that’s a fun one 🙂 I’m used to specifying a specific format. The golang method is to tell you to specify the layout with a particular date instead. That took me quite a long time to figure out. I’ll admit, it’s definitely easier, but damn, it tripped me over for a while since I couldn’t believe what I was seeing.

One of the things I’ve done quite a lot is write web scrapers. I used colly briefly to do some scraping. It’s a very cool API. I did some decent scraping. It was quite easy to write. The bit I found amusing is that I found a bug in a dependent library and sent in a pull request for a fix.

I didn’t have to wait for it to be merged. I could compile my binary with my fixes myself. Of course, the side effect is every time one of my dependencies have a security bug fixed, I have to send a bug fix myself. That kinda sucks, but in the brave new world of containers, that’s the sort of thing I expect to happen.

This tiny experiment is slowly increasing in scope as I figure out what’s the next bit of new knowledge I need to gain. My current goal is to replace my python hello world web application in my kubernetes cluster with a golang one.

At some point in this process, I can also inject figuring out multi-stage builds. That should make the process of pushing out a new build to my hello-world app easier.

The other thing I’m thinking of is converting the salary converter backend to golang. If it works well, I might move it out to be hosted on a Kubernetes cluster rather than on a full-sized server. That should help split out some of the applications I run to be hosted in a shared but highly-available clusters that I don’t have to manage uptime.