Blog

  • Migrating Gerrit from H2 to PostgreSQL

    The last 3 months have been busy and challenging. I’ve moved cities and changed jobs. I now work for Red Hat on the Gluster project as a CI/Automation Engineer. After 2 years in Delhi, I’ve moved to Mumbai, right when the monsoons hit. I feel like I haven’t seen the city properly dry ever since we moved. On the plus side, I’ve gotten back into running again. Despite the crazy rains, I see other crazy people like me out running every weekend 🙂

    Gateway of India

    One of the first things I did when I started in May was to make a list of things I thought we need to fix. The first thing I noticed is that we were running quite an old version of Gerrit running on H2. For some reason, it fell over every couple of days. At that point, someone had to login to the server and restart Gerrit.

    The top two potential causes were a large H2 database file and an old version of Gerrit. So I decided to upgrade Gerrit. The first step was to move from H2 to PostgreSQL. I looked up how to convert from H2 all over the internet. Eventually, I decided the best way to go about it is to export to CSV and import the CSV files into PostgreSQL. So here’s a rough idea of how it went about:

    1. Get the list of tables.
    2. Use regular expressions in vim to generate the SQL to export all the tables.
    3. Create the PostgreSQL database and change Gerrit settings.
    4. Initialize Gerrit again so it will create the appropriate tables in PostgreSQL.
    5. Import the CSV files into PostgreSQL.

    Sounds suspiciously easy. Except it’s not. I learned a fun thing about PostgreSQL’s COPY. The HEADER parameter means that the first column is a header and will be ignored. If the order of columns in the CSV file doesn’t match the one in PostgreSQL, it doesn’t do anything about it.

    If your CSV has the following:

    id, email, name 

    And your table has the following:

    id, name, email 

    PostgreSQL doesn’t do the intuitive thing on it’s own.

    You have to explicitly define that. For some reason, I didn’t run into this in staging, perhaps H2 generated a CSV in the right order. My eventual script specified the order when importing.

    NOTE: If you’re upgrading Gerrit, the table names or columns may be different. I recommend generating them on your own based on what’s in your database.

  • Scraping the Indian Judicial System

    This blog post has been sitting in my drafts folder for a long time. It’s time I finished it. A while ago, I did some work for Vidhi, scraping the Supreme Court of India website. Later on, I started some of parts of the work to scrape a couple of High Courts. Here’s a few quick lessons from my experience:

    • Remember to be a good citizen. Scrape with a delay between each request and a unique user-agent. This may not always work, but as far as possible, make it easy for them to figure out you’re scraping.
    • ASP based websites are difficult to scrape. A bunch of Indian court websites are built on ASP and you can’t submit forms without JavaScript. I couldn’t get phantomjs or any of those libs to work either. If you can get them working, please talk to me! Sandeep has taken over from me and I’m sure he’ll find it useful.
    • Data is inconsistently inconsistent. This is a problem. You can make no assumptions about the data while scraping. The best you can do is collect everything and find patterns later. For example, a judge’s name may be written in different ways from case to case. You can normalize them later.
    • These sites aren’t highly available, so plan for retrying and backing off in your code. In fact, I’d recommend running the scraper overnight and never in the morning from 8 am to 12 pm.
    • Assume failure. Be ready for it. The first few times you write the code, you have to keep a close watch. It will fail in many different ways and you should be ready to add another Except clause to your code 🙂
    • Get more data than you need, because re-scraping will cost time.
    • Gujarat High Court has a JavaScript-based frontend. There’s an XHR endpoint that returns JSON. It’s the only site I’ve scraped which had a pleasant developer experience.
  • Cooked

    I’ve been reading Michael Pollan’s Cooked and watching the Netflix show. This is my favorite line from the book:

    Easy. You want Americans to eat less? I have the diet for you. Cook it for yourself. Eat anything you want–just as long as you’re willing to cook it yourself.

    There’s a very similar line in the show too:

    Eat anything you want. Enjoy all of your food. Anything you want. You want apple pie? Have a whole apple pie tonight. You want cookies with that apple pie? And ice cream with that apple pie? I’ll allow you to eat all the cookies, all the ice cream, and all the pie you can have tonight. I’m just gonna ask you to do one thing. Make all of them. Make the apple pie, make the ice cream, make the cookies. And you know what I know is gonna happen? You’re not having apple pie, ice cream, or cookies tonight.

    Image Credit: Justin making an apple pie by Justin Leonard on Flickr.

  • Upgrading Dependencies on a Django Website

    Our website has been running Django 1.6 since it was built in 2014. Django has moved on to newer versions since then. I’ve been contemplating updating it, but never found enough time. At one point, we decided to scrap the Django website and move to WordPress. Eventually, the convinience of Django won over. This mean, I had the unenviable task of upgrading Django. It took about a good 2 weeks of work in total. I took a few breaks to solve problems that I ran into. Here’s a sort of summary of problems I’ve found and how I solved them.

    Django-CMS is only compatible with Django 1.8 and not Django 1.9. I didn’t catch it the first time. That was my first mistake. After that, I pinned it to the latest version of Django 1.8.

    We were using South and I had to convert from South to Django’s migrations. When I did this migration, I got a traceback from Django-CMS at the python manage.py migrate --fake-initial step. Turns out one of those migrations needs to faked. So, I ended up doing this:

    python manage.py migrate --fake-initial cms --app python manage.py migrate --fake cms 0002_auto_20140816_1918 python manage.py migrate --fake-initial cms python manage.py migrate --fake-initial 

    We had a custom submenu plugin. This just plain stopped working. Turns out Django CMS made a few backward incompatible changes causing this breakage. This caused most pages to plain fail. It took me a long time to realize I should turn off all the plugins and enable them one at a time to discover the failure. The traceback in this instance didn’t help me pinpoint the error at all!

    We shipped a bunch of dependencies with our code instead of installing the plugin. A few plugins had blocker bugs, which we fixed in our “fork”, and shipped. The bugs were now fixed in the plugin and we could remove the in-code forks and just use them as dependencies. This bit was annoying but not too painful. Once I removed them from the code, we had a lighter footprint for our code and easier path to upgrades.

    This took me about one full month of work on and off. I would often run into problems. I forced myself to take a break when I was stuck. It often made me think of different approaches to the problem at hand. I’ll be handing over this codebase to someone else soon, and I’m much happier at the state I’m leaving it. It’s better than what it used to be when I started. After all, that’s pretty much what our job is, right?

  • New Delhi Marathon 2016

    On Sunday, I finished my first marathon, New Delhi Marathon, in 5:47:13. It was 42.195 km of fun, pain, and runner’s high. If the bib numbers are sequential in order of registration (I suspect they are), I’m the 8th person to register for the full marathon. That’s how excited I was about a full marathon right in heart of Delhi. The criticism in this post is because we want you to do better next year. You guys managed great things for the first edition and we’d love to see a better event next year.

    From the onset, the quality of this race would depend on its route. Having it pass through some of the major landmarks of Delhi was great. You guys pulled this one off, hats off to you. Extra points for the course being AIMS-approved.

    At the start of the race, there were a good number of race marshals and police; they stopped traffic and guided us. This was a great feeling. The people in cars and bikes on the route were cheering us on too!

    Rocking on Vandemataram Marg

    From about the 30K mark or so (at 9 am), I only saw three aid stations with water. The rest of them seem to have run dry and the volunteers were just sitting in chairs and chatting. To be clear, until this point, the race marshals and police were extremely helpful and cheery. As I was preparing for the race on Saturday night, I decided I’d rather carry extra weight than not have water. In retrospect, that was the best decision I’d made. My advice to fellow runners, when in doubt, carry your own hydration.

    The website seemed to say that roads would reopen by 11 am but I’m pretty sure we were navigating traffic around 9 am. I understand that this is not in your control, but an early warning would have been nice. What could have been in your control though is having route markers and/or race marshals until 11 am. This did not happen, as far as I could see.

    As my first marathon, I’m not happy with the last 2 hours (12.2K) of the race. This is when I was reaching my breaking point, which is my fault, thanks to less than ideal training. This is also when the support from the race organizers dwindled much less than what I expected. It took me 2 hours to complete the remaining 12.2K. I look forward to the next edition where I can better my own time at the New Delhi Marathon. I hope the organizers will also give me a better experience.