Catching up with Infrastructure Debt

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why..

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:

  • We run a pretty old version of Gerrit on CentOS 5.
  • We run a pretty old version of Jenkins on CentOS 6.
  • We run CentOS 6 for all our regressions machines.
  • We run CentOS 6 for all our build machines.
  • We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
  • We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.

That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:

  • Upgraded Gerrit to the then latest version.
  • Setup Gerrit staging to test newer versions regularly for scheduling migration.
  • Created new CentOS 7 VMs on our hardware and moved the builds in there.
  • Moved Gerrit over to a new CentOS 7 host.
  • Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
  • Upgraded Jenkins to the latest LTS.
  • Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)

If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.

Clang Analyze for Gluster

Deepshika recently worked on getting a clang analyze job for Gluster setup with Jenkins. This job worked on both our laptops, but not on our build machines that run CentOS. It appears that the problem was clang on CentOS…

Deepshika recently worked on getting a clang analyze job for Gluster setup with Jenkins. This job worked on both our laptops, but not on our build machines that run CentOS. It appears that the problem was clang on CentOS is 3.4 vs 4.0 on Fedora 26. It fails because one of our dependencies need -fno-stack-protector, which wasn’t in clang until 3.8 or so. It’s been on my list of things to fix. I realized that the right way would be to get a newer version of clang on Fedora. I could have just compiled clang or build 4.0 packages but I didn’t want to end up having to maintain the package for our specific install. I decided to reduce complexity by doing a compilation inside a Fedora 6 chroot. This sounded like the least likely to add maintenance burden. When I looked for documentation on how to go about this, I couldn’t find much. The mock man page, however, is very well written and that’s all I needed. This is the script I used comments about each step.

#!/bin/bash     # Create a new chroot     sudo mock -r fedora-26-x86_64 --init      # Install the build dependencies     sudo mock -r fedora-26-x86_64 --install langpacks-en glibc-langpack-en automake autoconf libtool flex bison openssl-devel libxml2-devel python-devel libaio-devel libibverbs-devel librdmacm-devel readline-devel lvm2-devel glib2-devel userspace-rcu-devel libcmocka-devel libacl-devel sqlite-devel fuse-devel redhat-rpm-config clang clang-analyzer git      # Copy the Gluster source code inside the chroot at /src     sudo mock -r fedora-26-x86_64 --copyin $WORKSPACE /src      # Execute commands in the chroot to build with clang     sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./autogen.sh"     sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./configure CC=clang --enable-gnfs --enable-debug"     sudo mock -r fedora-26-x86_64 --chroot "cd /src && scan-build -o /src/clangScanBuildReports -v -v --use-cc clang --use-analyzer=/usr/bin/clang make"      # Copy the output back into the working directory     sudo mock -r fedora-26-x86_64 --copyout /src/clangScanBuildReports $WORKSPACE/clangScanBuildReports      # Clean up the chroot     sudo mock -r fedora-26-x86_64 --clean

Gerrit OS Upgrade

When I started working on Gluster, Gerrit was a large piece of technical debt. We were running quite an old version on CentOS 5. Both of these items needed fixing. The Gerrit upgrade happened in June causing me a good amount of stress…

When I started working on Gluster, Gerrit was a large piece of technical debt. We were running quite an old version on CentOS 5. Both of these items needed fixing. The Gerrit upgrade happened in June causing me a good amount of stress for a whole week as I dealt with the fall out. The OS upgrade for Gerrit happened last weekend after a marathon working day that ended at 3 am. We ran into several hacks in the old setup and we worked on getting them working in a more acceptable manner. That took quite a bit of our time and energy. At the end of it, I’m happy to say, Gerrit now runs on a machine with CentOS 7. Now of course, it’s time to upgrade Gerrit again and start the whole cycle all over again.

There's light at the end of the tunnel, hopefully, it's not a train

Michael and I managed to coordinate well across timezones. We had a document going where we listed out the tasks to do. As we discovered more items, they went on the todo list. This document also listed all the hacks we discovered. We fixed some of them but did not move the fix over to Ansible. We left some hacks in because fixing it will take some more time.

Things we learned the hard way:

  • Running the git protocol with xinetd was a trial and error process to configure. It took me hours to get it right. Here’s the right config file:
service git {         disable         = no         socket_type     = stream         wait            = no         user            = nobody         server          = /usr/libexec/git-core/git-daemon         server_args     = --export-all --reuseaddr --base-path=/path/to/git/folder --inetd --verbose --base-path-relaxed         log_on_failure  += USERID } 
  • There was some selinux magic we needed for cgit. The documentation had some notes on how to get it right, but that didn’t work for us. Here’s what what needed:
semanage fcontext -a -t git_user_content_t "/path/to/git/folder(/.*)?" 
  • When you setup replication to Github for the first time, you need to add the Github host keys to known_hosts. The easiest way is to try to ssh into github. That will fail with a friendly error message and prompt you to add your keys. You could also get it from Github.
  • Gerrit needs AllowEncodedSlashes On and ProxyPass http://127.0.0.1:8081 nocanon. Without these two bits of configuration, Gerrit returns random 404s.

We’ve removed two big items out of our tech debt backlog and into successes over the past year or so. Next step is a tie between a Jenkins upgrade and a Gerrit upgrade 🙂

Image credit: Captain Tenneal Steam Train (license)

Test Automation on CentOS CI with Ansible

I work on Gluster, which is a distributed file system. Testing a distributed file system needs a distributed setup. We run regressions by faking…

I work on Gluster, which is a distributed file system. Testing a distributed file system needs a distributed setup. We run regressions by faking a distributed setup. We’re planning on using Glusto for real distributed functional testing. Centos CI gives us on-demand physical hardware to run tests. I’ve been working on defining our jobs on Centos CI with Jenkins Job Builder.

In the past, we created our jobs via the UI and committed the XML to a git repo. This version control is dependent on discipline from the person making the change. The system is not built around discipline. This does not scale well. Every person who wants to add a test needs to have access or work with someone who does have access to add a new job.

Manufacturing Line

With Jenkins Job Builder, the configuration is the single source of truth for any given job. As a bonus, this reduces code duplication. With David’s help, I wrote centos-ci-sample, which establishes a better pattern for Centos CI jobs. David pointed me to the post-task publisher, which makes sure that the nodes are returned to the pool even on a failing job. This sample is good, but worked best for jobs that needed just one node.

We’re starting to setup proper multi-node tests for functional tests with Glusto. I decided to use an Ansible playbook for the setup of the nodes. Our internal QE folks will be re-using this playbook for setup.

Converting a Jenkins XML to JJB YAMl is fun. It’s best to look at the UI and read the XML to get an idea of what the job does. Then write a YAML which does something close to that. Once you have a YAML, it’s best to convert it to XML and do a diff against the existing job. I use xmllint -c14n to make both XML files standardized. Then I use colordiff to compare the diff. This gives me an idea of what I’ve added/removed. There will always be some changes. JJB assumes some sane defaults.

Image credit: aldenjewell 1951 Plymouth Assembly Line (license)

Migrating Gerrit from H2 to PostgreSQL

The last 3 months have been busy and challenging. I’ve moved cities and changed jobs. I now work for Red Hat on the Gluster project as…

The last 3 months have been busy and challenging. I’ve moved cities and changed jobs. I now work for Red Hat on the Gluster project as a CI/Automation Engineer. After 2 years in Delhi, I’ve moved to Mumbai, right when the monsoons hit. I feel like I haven’t seen the city properly dry ever since we moved. On the plus side, I’ve gotten back into running again. Despite the crazy rains, I see other crazy people like me out running every weekend 🙂

Gateway of India

One of the first things I did when I started in May was to make a list of things I thought we need to fix. The first thing I noticed is that we were running quite an old version of Gerrit running on H2. For some reason, it fell over every couple of days. At that point, someone had to login to the server and restart Gerrit.

The top two potential causes were a large H2 database file and an old version of Gerrit. So I decided to upgrade Gerrit. The first step was to move from H2 to PostgreSQL. I looked up how to convert from H2 all over the internet. Eventually, I decided the best way to go about it is to export to CSV and import the CSV files into PostgreSQL. So here’s a rough idea of how it went about:

  1. Get the list of tables.
  2. Use regular expressions in vim to generate the SQL to export all the tables.
  3. Create the PostgreSQL database and change Gerrit settings.
  4. Initialize Gerrit again so it will create the appropriate tables in PostgreSQL.
  5. Import the CSV files into PostgreSQL.

Sounds suspiciously easy. Except it’s not. I learned a fun thing about PostgreSQL’s COPY. The HEADER parameter means that the first column is a header and will be ignored. If the order of columns in the CSV file doesn’t match the one in PostgreSQL, it doesn’t do anything about it.

If your CSV has the following:

id, email, name 

And your table has the following:

id, name, email 

PostgreSQL doesn’t do the intuitive thing on it’s own.

You have to explicitly define that. For some reason, I didn’t run into this in staging, perhaps H2 generated a CSV in the right order. My eventual script specified the order when importing.

NOTE: If you’re upgrading Gerrit, the table names or columns may be different. I recommend generating them on your own based on what’s in your database.