Better Problem Definition

I’m a core developer on CKAN at Open Knowledge, the most widely used data catalog software. Early this year, we released version 2.2 of CKAN with a complete overhaul of the filestore. Amusingly, right after that, we started getting more and more complaints about data loss from the old filestore from on the ckan-dev list. One of the many folks, helped narrow it down to a particular file called persisted_state.json.

This file is created by a library called ofs. Every time a new file is added to the filestore, OFS does the following:

  • Read the persisted_state.json file.
  • Convert the JSON to a Python dict.
  • Add an element to this dict with the metadata of the new file.
  • Convert the dict back to JSON.
  • Write this new JSON to persisted_state.json file.

This causes concurrency problems when things were added to the filestore at high frequency and eventually lead to data loss. Oh joy.

Technically, this wasn’t a bug in CKAN’s codebase. We already solved the core problem at this point by switching to a new filestore which did not use ofs. We couldn’t abandon our users though and I volunteered to find a fix. I read through ofs code and I thought of solving the problem there. After an hour or two of reading up on concurrency and documentation on the python, I still didn’t have a working solution. Eventually, I asked myself what I was looking to solve.

My original problem: “OFS is not thread-safe, causing data loss”. I then realized, that’s not what I wanted to solve. A better problem to solve was: “OFS is not thread-safe, causing data loss. Our users need their data.”. So, I wrote a script that would re-generate the persisted_state.json file with just enough metadata to start working. It isn’t a complete fix, but it was a productive fix. The script was “dramatically” called ofs-hero.

Lesson Learnt: Defining the problem properly helps you solve it better.







Leave a Reply