Author: nigelb

  • Upgrading the Gluster Jenkins Server

    I’ve been wanting to work on upgrading build.gluster.org setup for ages. There’s a lot about that setup that isn’t ideal in how people use Jenkins anymore.

    We used the unix user accounts for access to Jenkins. This means Jenkins needs to read /etc/passwd and everyone has SSH access via passwords by default. Very often, the username wasn’t tied to an actual email address. I had to guess the account owner based on their usernames elsewhere. This was also open to brute force attacks. The only way to change passwords was to login to the server and run passwd command. We fixed this problem a few months ago by switching our auth to Github. Now access control is a Github group which gives you more permissions. Logging in will not give you any more permissions than not logging in.

    Our todo list during the Jenkins upgrade

    Jenkins community now recommends not running jobs on the master node at all. But our old setup depended on certain jobs always running on master. One by one, I’ve eliminated them so that they can now run on any node agent. The last job left is our release job. We make the tar from every release available on an FTP-like server. In our old setup, the this server and Jenkins were the same machine. The job ran on master and depended on them both being the same machine. We decided to split up the systems so we could take down Jenkins without any issue. We intend to fix this with an SCP command at the end of the release job to copy artifacts to the FTP-like server.

    One of the Red Hat buildings in Brno

    Now, we have a Jenkins setup that I’m happy with. At this point, we’ve fixed a vast majority of the annoying CI-related infra issues. In a few years, we’ll rip them all out and re-do them. For now, spending a week with my colleague in Brno working on an Infra sprint has been well worth our time and energy.

  • Catching up with Infrastructure Debt

    If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:

    • We run a pretty old version of Gerrit on CentOS 5.
    • We run a pretty old version of Jenkins on CentOS 6.
    • We run CentOS 6 for all our regressions machines.
    • We run CentOS 6 for all our build machines.
    • We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
    • We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.

    That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:

    • Upgraded Gerrit to the then latest version.
    • Setup Gerrit staging to test newer versions regularly for scheduling migration.
    • Created new CentOS 7 VMs on our hardware and moved the builds in there.
    • Moved Gerrit over to a new CentOS 7 host.
    • Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
    • Upgraded Jenkins to the latest LTS.
    • Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)

    If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.

  • Problems You Might Run Into Upgrading PostgreSQL on Fedora

    I was trying to test some code today and I realized I need a working PostgreSQL server. When I tried to start the server, it failed with this error.

    Aug 23 15:36:10 athena systemd[1]: Starting PostgreSQL database server... Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: An old version of the database format was found. Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: Use 'postgresql-setup --upgrade' to upgrade to version '9.6' Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: See /usr/share/doc/postgresql/README.rpm-dist for more information. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Control process exited, code=exited status=1 Aug 23 15:36:10 athena systemd[1]: Failed to start PostgreSQL database server. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Unit entered failed state. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Failed with result 'exit-code'.

    Ah, I upgraded to F26 recently and I suppose that came with a new version of PostgreSQL. I figured fixing this should be trivial. Well, not exactly. When I first ran the command, it asked me to install the postgresql-upgrade package. Once I did install it, the command threw a strange error.

    [root@athena pgsql]# postgresql-setup --upgrade  * Upgrading database. ERROR: The pidfile '/var/lib/pgsql/data-old/postmaster.pid' exists.  Verify that there is no postmaster        running the /var/lib/pgsql/data-old directory. ERROR: Upgrade failed.  * See /var/lib/pgsql/upgrade_postgresql.log for details.

    The /var/lib/pgsql/data-old/postmaster.pid file doesn’t even exist. It took me some time to realize that it’s actually looking at /var/lib/pgsql/data/postmaster.pid, which does exist. I think at some point, I had a running PostgreSQL server and I didn’t shutdown the computer cleanly. This lead to a stale PID file. Once I renamed the PID file, the upgrade command worked.

  • Clang Analyze for Gluster

    Deepshika recently worked on getting a clang analyze job for Gluster setup with Jenkins. This job worked on both our laptops, but not on our build machines that run CentOS. It appears that the problem was clang on CentOS is 3.4 vs 4.0 on Fedora 26. It fails because one of our dependencies need -fno-stack-protector, which wasn’t in clang until 3.8 or so. It’s been on my list of things to fix. I realized that the right way would be to get a newer version of clang on Fedora. I could have just compiled clang or build 4.0 packages but I didn’t want to end up having to maintain the package for our specific install. I decided to reduce complexity by doing a compilation inside a Fedora 6 chroot. This sounded like the least likely to add maintenance burden. When I looked for documentation on how to go about this, I couldn’t find much. The mock man page, however, is very well written and that’s all I needed. This is the script I used comments about each step.

    #!/bin/bash     # Create a new chroot     sudo mock -r fedora-26-x86_64 --init      # Install the build dependencies     sudo mock -r fedora-26-x86_64 --install langpacks-en glibc-langpack-en automake autoconf libtool flex bison openssl-devel libxml2-devel python-devel libaio-devel libibverbs-devel librdmacm-devel readline-devel lvm2-devel glib2-devel userspace-rcu-devel libcmocka-devel libacl-devel sqlite-devel fuse-devel redhat-rpm-config clang clang-analyzer git      # Copy the Gluster source code inside the chroot at /src     sudo mock -r fedora-26-x86_64 --copyin $WORKSPACE /src      # Execute commands in the chroot to build with clang     sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./autogen.sh"     sudo mock -r fedora-26-x86_64 --chroot "cd /src && ./configure CC=clang --enable-gnfs --enable-debug"     sudo mock -r fedora-26-x86_64 --chroot "cd /src && scan-build -o /src/clangScanBuildReports -v -v --use-cc clang --use-analyzer=/usr/bin/clang make"      # Copy the output back into the working directory     sudo mock -r fedora-26-x86_64 --copyout /src/clangScanBuildReports $WORKSPACE/clangScanBuildReports      # Clean up the chroot     sudo mock -r fedora-26-x86_64 --clean
  • Crucial Conversations Training

    In the first week of July, I attended an internal training on Crucial Conversations. I’ve been eyeing that training ever since I started at Red Hat. It’s a skill that I’m poor at. I tend to avoid difficult conversations. When do I have them, I let emotions get the better of me or go the path that has the least amount of conflict. The training involves hands-on practice with methods and techniques taught in the book. I’d read about half the book before I went into the training, but the training was way more effective.

    One of the possible outcomes of a crucial conversation

    I learned two important lessons from this training. One, you can get into a conversation and very often it can turn into who’s winning and who’s losing. At this point, it’s very likely you’ve lost track of the original goal of the conversation. The second is to observe when a conversation is escalating due to aggression or silence. De-escalate the conversation first. Otherwise, you’ll have of two things happen. The other person will be angry and the conversation becomes a conflict. Or the person will agree to everything while utterly unhappy about it.

    Image Credit: Jule Falk on Flickr (license)