Author: nigelb

Airtel Delhi Half Marathon 2017
After 2 years, I ran ADHM again this year. This is the one time I nearly dropped because of alarming levels of pollution the week before the event. If Procam does not change the dates of the event in 2018, I don’t see myself running this event again. We got lucky this time with the air clearing up the week leading up to the event.
Delhi has been the best city for my running. This is where I went from C25K to running my first marathon over the course of 2 years. When we moved away, I almost stopped running completely. Bombay was not conducive to running for me. Running in a park with a radius of 200m gets old fast. It takes 25 loops to get to a 5K and the monsoons are a constant threat. I haven’t run more than 400 km this year. I went into the race with little to no training. I was hoping that this would kickstart my training cycle for the next big race.
I met Bulbul at the start line and we started about together. I was in C group given that my last ADHM timing was 2h 11m. Last time I prepped for cold weather forgetting that it warms over the course of the race. This time, I was wearing my trusty shorts and T-shirt, which I wore in Brno and Prague in sub 10C weather. Given my lack of training, I started slow and let people pass me over the course of the race.
I picked up a timing chart for finishing in 3h during the race expo. That was near perfect for me. I started the race with a mask as well. The mask was uncomfortable because I hadn’t trained with it. At around the 7 km mark I gave up on the mask. I banked a few minutes because of the speed in the first 5 km or so. I kept going faster than the plan, but slower than a 2h 45m finish. Until about the halfway mark, I was running with occasional walking breaks to drink water. After the 10.5K mark, I started maintaining the goal pace that was on my chart, which means I slowed down. At around the 16K mark, my back was giving me trouble, so I decided to walk for a bit and then run. I walked most of the way back after that picking up pace only after turning into the stadium. Procam had the aid stations well staffed and there was water and ORSIL till the very end.

As I expected, this has kickstarted my training. It’s a little challenging to go out in the winter to run. And it’s not very motivating when my performance is below where I left it off. But I’ve done this before. I’ve gone from near zero training to fast half marathons. It’s only a matter of consistent training for the next year or so before I break my half marathon best again.
2017-11-29
Remote Triggering a Jenkins Job
All I wanted to do was trigger a job on another Jenkins instance. Here’s all the things I tried.
- Tried out a plugin. This plugin does not work
- Forked the plugin and applied some of the patches that have been contributed.
- I wrote Python code to do it.
- I wanted to get a “Build Cause” working and since that didn’t work on the first few tries, I added it as a parameter.
It turns out that what I thought was working wasn’t actually working. I wasn’t passing parameters to my remote job. It was using the defaults. The fix for this problem is the most hilarious that I’ve seen. Turns out if you use the crumbs API and Jenkins auth, you don’t need the token. This was a bit of a lovely discovery after all this pain.
Now I need to figure out how to follow the Jenkins job, i.e. get the console output on remote Jenkins in real time. I found a python script that does exactly that. I tested it and it works.
2017-11-20
Upgrading the Gluster Jenkins Server
I’ve been wanting to work on upgrading build.gluster.org setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.
We used the unix user accounts for access to Jenkins. This means Jenkins needs to read /etc/passwd and everyone has SSH access via passwords by default. Very often, the username wasn’t tied to an actual email address. I had to guess the account owner based on their usernames elsewhere. This was also open to brute force attacks. The only way to change passwords was to login to the server and run passwd command. We fixed this problem a few months ago by switching our auth to Github. Now access control is a Github group which gives you more permissions. Logging in will not give you any more permissions than not logging in.

Jenkins community now recommends not running jobs on the master node at all. But our old setup depended on certain jobs always running on master. One by one, I’ve eliminated them so that they can now run on any node agent. The last job left is our release job. We make the tar from every release available on an FTP-like server. In our old setup, the this server and Jenkins were the same machine. The job ran on master and depended on them both being the same machine. We decided to split up the systems so we could take down Jenkins without any issue. We intend to fix this with an SCP command at the end of the release job to copy artifacts to the FTP-like server.

Now, we have a Jenkins setup that I’m happy with. At this point, we’ve fixed a vast majority of the annoying CI-related infra issues. In a few years, we’ll rip them all out and re-do them. For now, spending a week with my colleague in Brno working on an Infra sprint has been well worth our time and energy.
2017-11-16
Catching up with Infrastructure Debt
If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:
- We run a pretty old version of Gerrit on CentOS 5.
- We run a pretty old version of Jenkins on CentOS 6.
- We run CentOS 6 for all our regressions machines.
- We run CentOS 6 for all our build machines.
- We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
- We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.
That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:
- Upgraded Gerrit to the then latest version.
- Setup Gerrit staging to test newer versions regularly for scheduling migration.
- Created new CentOS 7 VMs on our hardware and moved the builds in there.
- Moved Gerrit over to a new CentOS 7 host.
- Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
- Upgraded Jenkins to the latest LTS.
- Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)
If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.
2017-11-05

Problems You Might Run Into Upgrading PostgreSQL on Fedora

I was trying to test some code today and I realized I need a working PostgreSQL server. When I tried to start the server, it failed with this error.

Aug 23 15:36:10 athena systemd[1]: Starting PostgreSQL database server... Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: An old version of the database format was found. Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: Use 'postgresql-setup --upgrade' to upgrade to version '9.6' Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: See /usr/share/doc/postgresql/README.rpm-dist for more information. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Control process exited, code=exited status=1 Aug 23 15:36:10 athena systemd[1]: Failed to start PostgreSQL database server. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Unit entered failed state. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Failed with result 'exit-code'.

Ah, I upgraded to F26 recently and I suppose that came with a new version of PostgreSQL. I figured fixing this should be trivial. Well, not exactly. When I first ran the command, it asked me to install the postgresql-upgrade package. Once I did install it, the command threw a strange error.

[root@athena pgsql]# postgresql-setup --upgrade  * Upgrading database. ERROR: The pidfile '/var/lib/pgsql/data-old/postmaster.pid' exists.  Verify that there is no postmaster        running the /var/lib/pgsql/data-old directory. ERROR: Upgrade failed.  * See /var/lib/pgsql/upgrade_postgresql.log for details.

The /var/lib/pgsql/data-old/postmaster.pid file doesn’t even exist. It took me some time to realize that it’s actually looking at /var/lib/pgsql/data/postmaster.pid, which does exist. I think at some point, I had a running PostgreSQL server and I didn’t shutdown the computer cleanly. This lead to a stale PID file. Once I renamed the PID file, the upgrade command worked.

2017-08-23

Author: nigelb

Airtel Delhi Half Marathon 2017

Remote Triggering a Jenkins Job

Upgrading the Gluster Jenkins Server

Catching up with Infrastructure Debt

Problems You Might Run Into Upgrading PostgreSQL on Fedora