I’m a core developer on CKAN at Open Knowledge, the most widely used data
catalog software. Early this year, we released version 2.2 of CKAN with
a complete overhaul of the filestore. Amusingly, right after that, we started
getting more and more complaints about data loss from the old filestore from on
the ckan-dev list. One of the many folks, helped narrow it down to a particular
This file is created by a library called
ofs. Every time a new file is
added to the filestore, OFS does the following:
dictwith the metadata of the new file.
dictback to JSON.
This causes concurrency problems when things were added to the filestore at high frequency and eventually lead to data loss. Oh joy.
Technically, this wasn’t a bug in CKAN’s codebase. We already solved the core problem at this point by switching to a new filestore which did not use ofs. We couldn’t abandon our users though and I volunteered to find a fix. I read through ofs code and I thought of solving the problem there. After an hour or two of reading up on concurrency and documentation on the python, I still didn’t have a working solution. Eventually, I asked myself what I was looking to solve.
My original problem: “OFS is not thread-safe, causing data loss”. I then
realized, that’s not what I wanted to solve. A better problem to solve was:
“OFS is not thread-safe, causing data loss. Our users need their data.”. So,
I wrote a script that would re-generate the
persisted_state.json file with
just enough metadata to start working. It isn’t a complete fix, but it was
a productive fix. The script was “dramatically” called ofs-hero.
Lesson Learnt: Defining the problem properly helps you solve it better.