Catching up with Infrastructure Debt

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why..

If you run an infrastructure, there’s a good chance you have some debt tucked in your system somewhere. There’s also a good chance that you’re not getting enough time to fix those debts. There will most likely be a good reason why something is done in the way it is. This is just how things are in general. After I joined Gluster, I’ve worked with my fellow sysadmin to tackle our large infrastructure technical debt over the course of time. It goes like this:

  • We run a pretty old version of Gerrit on CentOS 5.
  • We run a pretty old version of Jenkins on CentOS 6.
  • We run CentOS 6 for all our regressions machines.
  • We run CentOS 6 for all our build machines.
  • We run NetBSD on Rackspace in a setup that is not easy to automate nor is it currently part of our automation.
  • We have a bunch of physical machines in a DC, but we haven’t had time to move our VMs over and use Rackspace as burstable capacity.

That is in no way and exhaustive list. But we’ve managed to tackle 2.5 items from the list. Here’s what we did in order:

  • Upgraded Gerrit to the then latest version.
  • Setup Gerrit staging to test newer versions regularly for scheduling migration.
  • Created new CentOS 7 VMs on our hardware and moved the builds in there.
  • Moved Gerrit over to a new CentOS 7 host.
  • Wrote ansible scripts to manage most of Gerrit, but deployed currently only to staging.
  • Upgraded Jenkins to the latest LTS.
  • Moved Jenkins to a CentOS 7 host (Done last week, more details coming up!)

If I look at it, it almost looks like I’ve failed. But again, like dealing with most infrastructure debt, you touch one thing and you realize it’s broken in someway and someone depended on that breakage. What I’ve done is I’ve had to pick and prioritize what things I would spend my time on. At the end of the day, I have to justify my time in terms of moving the project forward. Fixing the infrastructure debt for Gerrit was a great example. I could actually focus on it with everyone’s support. Fixing Jenkins was a priority since we wanted to use some of the newer features, again I had backing to do that. Moving things to our hardware is where things get tricky. There’s some financial goals we can hit if we make the move, but outside of that, we have no reason to move. But long-term, we want to me mostly in our hardware, since we spent money on it. This is, understandably going slow. There’s a subtle capacity difference and the noisy neighbor problem affects us quite strongly when we try to do anything in this regard.

Problems You Might Run Into Upgrading PostgreSQL on Fedora

I was trying to test some code today and I realized I need a working PostgreSQL server. When I tried to start the server, it failed and this is the error I got.

I was trying to test some code today and I realized I need a working PostgreSQL server. When I tried to start the server, it failed with this error.

Aug 23 15:36:10 athena systemd[1]: Starting PostgreSQL database server... Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: An old version of the database format was found. Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: Use 'postgresql-setup --upgrade' to upgrade to version '9.6' Aug 23 15:36:10 athena postgresql-check-db-dir[20713]: See /usr/share/doc/postgresql/README.rpm-dist for more information. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Control process exited, code=exited status=1 Aug 23 15:36:10 athena systemd[1]: Failed to start PostgreSQL database server. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Unit entered failed state. Aug 23 15:36:10 athena systemd[1]: postgresql.service: Failed with result 'exit-code'.

Ah, I upgraded to F26 recently and I suppose that came with a new version of PostgreSQL. I figured fixing this should be trivial. Well, not exactly. When I first ran the command, it asked me to install the postgresql-upgrade package. Once I did install it, the command threw a strange error.

[root@athena pgsql]# postgresql-setup --upgrade  * Upgrading database. ERROR: The pidfile '/var/lib/pgsql/data-old/postmaster.pid' exists.  Verify that there is no postmaster        running the /var/lib/pgsql/data-old directory. ERROR: Upgrade failed.  * See /var/lib/pgsql/upgrade_postgresql.log for details.

The /var/lib/pgsql/data-old/postmaster.pid file doesn’t even exist. It took me some time to realize that it’s actually looking at /var/lib/pgsql/data/postmaster.pid, which does exist. I think at some point, I had a running PostgreSQL server and I didn’t shutdown the computer cleanly. This lead to a stale PID file. Once I renamed the PID file, the upgrade command worked.

Pycon Pune 2017

I haven’t attended a Pycon since 2013. Now that I started writing this post, I’ve realized it’s been nearly 4 years since and Python is the language I use the most. The last Pycon was a great place to meet people and make friends.

I haven’t attended a Pycon since 2013. Now that I started writing this post, I’ve realized it’s been nearly 4 years since and Python is the language I use the most. The last Pycon was a great place to meet people and make friends. Among others, I recall clearly that I met Sankarshan, my current manager, for the first time there. Pycon Pune is also the first time I’m speaking at a single track event. There’s something scary about so many people paying attention to you and making sure they’re not bored.

The venue for the event was gorgeous (as evidenced by the group picture that nearly looks photoshopped!) and the event was well organized, I have to say. My only critical feedback is a space outside of the main conference area for a hallway track. The auditorium had air conditioning and everyone went in thanks to it. If we had a little bit of space with power and air conditioning that you could use if you wanted to have a conversation, that would be highly beneficial. I like attending large events, but sometimes, the introvert in me takes over and I want to spend more time either alone or with less interaction. Linuxcon EU was great about this, going so far as to have a quiet space, which I found useful.

I had trepeditions about my talk. It wasn’t exactly about solving a problem with Python. It was about problems I’ve faced throughout my career and how I’ve seen other projects solve them. Occasionally, those problems or solutions were related to Python, sometimes they were related to my work on Gluster, and often to Mozilla. I’m glad it was well recived and I had a lot of conversations with people after the talk about the pains they face at their own organization. I’ll be the first to admit that I don’t practice what I preach. We’re still working on getter our release management to a better place.

Some of the memorable sessions include – Hanza’s keynote about his open source life, Katie’s talk about accessibility, Dr. Terri’s talk about security, Noufal’s talk about CFFI. All videos should be online on the Pycon Pune channel, including mine.

Scraping the Indian Judicial System

This blog post has been sitting in my drafts folder for a long time. It’s time I finished it. A while ago, I did some work for Vidhi, which involved…

This blog post has been sitting in my drafts folder for a long time. It’s time I finished it. A while ago, I did some work for Vidhi, scraping the Supreme Court of India website. Later on, I started some of parts of the work to scrape a couple of High Courts. Here’s a few quick lessons from my experience:

  • Remember to be a good citizen. Scrape with a delay between each request and a unique user-agent. This may not always work, but as far as possible, make it easy for them to figure out you’re scraping.
  • ASP based websites are difficult to scrape. A bunch of Indian court websites are built on ASP and you can’t submit forms without JavaScript. I couldn’t get phantomjs or any of those libs to work either. If you can get them working, please talk to me! Sandeep has taken over from me and I’m sure he’ll find it useful.
  • Data is inconsistently inconsistent. This is a problem. You can make no assumptions about the data while scraping. The best you can do is collect everything and find patterns later. For example, a judge’s name may be written in different ways from case to case. You can normalize them later.
  • These sites aren’t highly available, so plan for retrying and backing off in your code. In fact, I’d recommend running the scraper overnight and never in the morning from 8 am to 12 pm.
  • Assume failure. Be ready for it. The first few times you write the code, you have to keep a close watch. It will fail in many different ways and you should be ready to add another Except clause to your code 🙂
  • Get more data than you need, because re-scraping will cost time.
  • Gujarat High Court has a JavaScript-based frontend. There’s an XHR endpoint that returns JSON. It’s the only site I’ve scraped which had a pleasant developer experience.

Upgrading Dependencies on a Django Website

Our website has been running Django 1.6 since it was built in 2014. Django has moved on to newer versions since then. I’ve been contemplating…

Our website has been running Django 1.6 since it was built in 2014. Django has moved on to newer versions since then. I’ve been contemplating updating it, but never found enough time. At one point, we decided to scrap the Django website and move to WordPress. Eventually, the convinience of Django won over. This mean, I had the unenviable task of upgrading Django. It took about a good 2 weeks of work in total. I took a few breaks to solve problems that I ran into. Here’s a sort of summary of problems I’ve found and how I solved them.

Django-CMS is only compatible with Django 1.8 and not Django 1.9. I didn’t catch it the first time. That was my first mistake. After that, I pinned it to the latest version of Django 1.8.

We were using South and I had to convert from South to Django’s migrations. When I did this migration, I got a traceback from Django-CMS at the python manage.py migrate --fake-initial step. Turns out one of those migrations needs to faked. So, I ended up doing this:

python manage.py migrate --fake-initial cms --app python manage.py migrate --fake cms 0002_auto_20140816_1918 python manage.py migrate --fake-initial cms python manage.py migrate --fake-initial 

We had a custom submenu plugin. This just plain stopped working. Turns out Django CMS made a few backward incompatible changes causing this breakage. This caused most pages to plain fail. It took me a long time to realize I should turn off all the plugins and enable them one at a time to discover the failure. The traceback in this instance didn’t help me pinpoint the error at all!

We shipped a bunch of dependencies with our code instead of installing the plugin. A few plugins had blocker bugs, which we fixed in our “fork”, and shipped. The bugs were now fixed in the plugin and we could remove the in-code forks and just use them as dependencies. This bit was annoying but not too painful. Once I removed them from the code, we had a lighter footprint for our code and easier path to upgrades.

This took me about one full month of work on and off. I would often run into problems. I forced myself to take a break when I was stuck. It often made me think of different approaches to the problem at hand. I’ll be handing over this codebase to someone else soon, and I’m much happier at the state I’m leaving it. It’s better than what it used to be when I started. After all, that’s pretty much what our job is, right?