I’m a core developer on CKAN at Open Knowledge, the most widely used data catalog software. Early this year, we released version 2.2 of CKAN with a complete overhaul of the filestore. Amusingly, right after that, we started getting more and more complaints about data loss from the old filestore from on the ckan-dev list. One of the many folks, helped narrow it down to a particular file called persisted_state.json
.
This file is created by a library called ofs
. Every time a new file is added to the filestore, OFS does the following:
- Read the
persisted_state.json
file. - Convert the JSON to a Python
dict
. - Add an element to this
dict
with the metadata of the new file. - Convert the
dict
back to JSON. - Write this new JSON to
persisted_state.json
file.
This causes concurrency problems when things were added to the filestore at high frequency and eventually lead to data loss. Oh joy.
Technically, this wasn’t a bug in CKAN’s codebase. We already solved the core problem at this point by switching to a new filestore which did not use ofs. We couldn’t abandon our users though and I volunteered to find a fix. I read through ofs code and I thought of solving the problem there. After an hour or two of reading up on concurrency and documentation on the python, I still didn’t have a working solution. Eventually, I asked myself what I was looking to solve.
My original problem: “OFS is not thread-safe, causing data loss”. I then realized, that’s not what I wanted to solve. A better problem to solve was: “OFS is not thread-safe, causing data loss. Our users need their data.”. So, I wrote a script that would re-generate the persisted_state.json
file with just enough metadata to start working. It isn’t a complete fix, but it was a productive fix. The script was “dramatically” called ofs-hero.
Lesson Learnt: Defining the problem properly helps you solve it better.
Leave a Reply