This blog post has been sitting in my drafts folder for a long time. It’s time
I finished it. A while ago, I did some work for Vidhi, scraping the
Supreme Court of India website. Later on, I started some of parts of the work
to scrape a couple of High Courts. Here’s a few quick lessons from my
- Remember to be a good citizen. Scrape with a delay between each request and a
unique user-agent. This may not always work, but as far as possible, make it
easy for them to figure out you’re scraping.
- ASP based websites are difficult to scrape. A bunch of Indian court websites
get phantomjs or any of those libs to work either. If you can get them
working, please talk to me! Sandeep has taken over from me and I’m
sure he’ll find it useful.
- Data is inconsistently inconsistent. This is a problem. You can make no
assumptions about the data while scraping. The best you can do is collect
everything and find patterns later. For example, a judge’s name may be
written in different ways from case to case. You can normalize them later.
- These sites aren’t highly available, so plan for retrying and backing off in
your code. In fact, I’d recommend running the scraper overnight and never in
the morning from 8 am to 12 pm.
- Assume failure. Be ready for it. The first few times you write the code, you
have to keep a close watch. It will fail in many different ways and you
should be ready to add another
Except clause to your code :)
- Get more data than you need, because re-scraping will cost time.
that returns JSON. It’s the only site I’ve scraped which had a pleasant