I’m a core developer on CKAN at Open Knowledge, the most widely used data
catalog software. Early this year, we released version 2.2 of CKAN with
a complete overhaul of the filestore. Amusingly, right after that, we started
getting more and more complaints about data loss from the old filestore from on
the ckan-dev list. One of the many folks, helped narrow it down to a particular
This file is created by a library called
ofs. Every time a new file is
added to the filestore, OFS does the following:
* Read the
* Convert the JSON to a Python
* Add an element to this
dict with the metadata of the new file.
* Convert the
dict back to JSON.
* Write this new JSON to
This causes concurrency problems when things were added to the filestore at high frequency and eventually lead to data loss. Oh joy.
Technically, this wasn’t a bug in CKAN’s codebase. We already solved the core problem at this point by switching to a new filestore which did not use ofs. We couldn’t abandon our users though and I volunteered to find a fix. I read through ofs code and I thought of solving the problem there. After an hour or two of reading up on concurrency and documentation on the python, I still didn’t have a working solution. Eventually, I asked myself what I was looking to solve.
My original problem: “OFS is not thread-safe, causing data loss”. I then
realized, that’s not what I wanted to solve. A better problem to solve was:
“OFS is not thread-safe, causing data loss. Our users need their data.”. So,
I wrote a script that would re-generate the
persisted_state.json file with
just enough metadata to start working. It isn’t a complete fix, but it was
a productive fix. The script was “dramatically” called ofs-hero.
Lesson Learnt: Defining the problem properly helps you solve it better.