This blog post has been sitting in my drafts folder for a long time. It’s time I finished it. A while ago, I did some work for Vidhi, scraping the Supreme Court of India website. Later on, I started some of parts of the work to scrape a couple of High Courts. Here’s a few quick lessons from my experience:
- Remember to be a good citizen. Scrape with a delay between each request and a unique user-agent. This may not always work, but as far as possible, make it easy for them to figure out you’re scraping.
- Data is inconsistently inconsistent. This is a problem. You can make no assumptions about the data while scraping. The best you can do is collect everything and find patterns later. For example, a judge’s name may be written in different ways from case to case. You can normalize them later.
- These sites aren’t highly available, so plan for retrying and backing off in your code. In fact, I’d recommend running the scraper overnight and never in the morning from 8 am to 12 pm.
- Assume failure. Be ready for it. The first few times you write the code, you have to keep a close watch. It will fail in many different ways and you should be ready to add another
Exceptclause to your code 🙂
- Get more data than you need, because re-scraping will cost time.