Spec: DataStore and FileStore Consolidation - ckan/ckan GitHub Wiki

Originally based on this Google doc: https://docs.google.com/a/okfn.org/document/d/1cBW89bWtT2uMovasxHOHZ9hFcE0Gy612I9JjwPKTG5E/edit

Related issues: https://github.com/okfn/ckan/issues?labels=Datastore+Filestore+consolidation

Problem

The problems with the DataStore and FileStore in CKAN 2.0 include:

  • There's a big Data API button at the top of resource pages that, in almost all cases, is disabled, prompting lots of users to ask how can I enable this button?
  • Users don't understand or care about details such as whether their data has been added to the datastore or the filestore or both, users should be protected from this complexity (here "users" probably does not include sysadmins who need to deploy CKAN and therefore need to know about the filestore and the datastore and everything)
  • It's possible for the source file (uploaded or linked to) and the corresponding data in the datastore to diverge from eachother, meaning that the data seen in the data preview or data api is different from what you get if you download the file, which is confusing. And then if someone uploads a new copy of the source file it'll overwrite the edits in the datastore!
  • Currently on most (all?) of our sites no resource files are getting pulled into the datastore because we're not deploying the datastorer extension because it's too hard to deploy and maintain, so the datastore API is not available and previews work via the dataproxy which is unreliable.
  • If someone uploads an Excel file containing multiple sheets, only the first sheet goes into the datastore. This needs to be communicated to the user, or the behaviour improved to eg. create a resource for each sheet.

Decision

Note: this below is pretty much how CKAN currently behaves! The only difference is that resources are in a read-only mode by default. The read-write mode has to be enabled explicitly.

...

1. User uploads a file                  2. User links to a file        3. User links to an API
        |                                    |                                    |
        V                                    |                                    |
2. File goes into CKAN's FileStore           |                                    |
        |                                    |                                    |
        |        ____________________________|                                    |
        |        |                                                                |
        V        V                                                                v
4. File goes into CKAN's DataStore.                                    9. Only link to API, don't import
   Data preview and data query API (read-only) are enabled.               anything, disable preview. 
   "Download" button to download original file is shown.
   The datastore resource is *read-only*.
    |
    V
5. At this point, the user can upload a new version of                 10. User creates a *read-write* 
   the file and the data in the datastore will be replaced.                enabled resource through the
    |                                                                      the API without a URL.
    |                                                                                |
    V                                                                                |
6. User chooses to enable the *data update API*/ *read-write mode*.     <-------------
   This has to be done through the API. The user can now make edits
   to the datastore.
    |
    V
7. "Download" button now downloads the current version of the data from the
   datastore, exported as a CSV file (#628). The url of the resource has been
   changed automatically.
    |
    V
8. At this point if the user tries to upload a new version of the file,
   they get a warning that the data currently in CKAN will be overwritten
   with the data from the file, and can choose to continue if they wish.

Notes:

At 2 the user never needs to know about or see the word "FileStore" all they know is they can upload the file.

At 4 it would be the datastorer service or paster command/cron job that pushes the data into the datastore, but again the user never needs to know about this or see the word "DataStore". The datastore is read only means thatdatastore_upsert is not available.

At 5, maybe it would be nice to one day support versioning of uploaded files in the filestore, so that users can preview and download older versions of files.

  • The datapusher is triggered (and overwrites the data in the datastore) whenever
  • A resource is created
  • The URL of a resource is changed
  • A cron job triggered the datapusher
  • A user triggered the datapusher through paster
  • A user pressed the "reload data" button on a resource

It should not be triggered for read-write enabled resources.

At 6, maybe the user would have to explicitly click a button that says something like "Enable data editing for this resource" or maybe data editing is just always enabled and the following changes simply happen automatically after the first time the user does a data editing action.

At 7, this is where the user would get the web-based data editor and the data versioning and data history viewing, if we ever implement such features. (But for now data editing just means using the datastore update API)

optional: A new "Download original file" button appears that downloads the original file from the FileStore (maybe this is hidden behind the download current version button, e.g. in small text or in a dropdown etc)

At 8, if we ever implement datastore versioning then after uploading a new version of their file and wiping out their data, they could still use the data history view to get back the previous version of their data.

We need to prevent the data in the datastore and filestore from ever diverging if there are simultaneous edits. When a user creates a new resource by uploadiing a file, don't activate the datastore until after the file has finished uploading and all the data from the file has been successfully pushed into the datastore. When a user updates an existing resource by uploading a new copy of the file, as soon as they accept the "this will overwrite data in CKAN" warning and start uploading the file the datastore's editing functions should be temporarily disabled, and not enabled again until the file has finished uploading and all the data from the file has been successfully pushed into the datastore. What do we do when a file upload successfully, but pulling the data from the file into the datastore fails? In this case it seems to me that we should disable the datastore (both read and write) for that resource and show an error.

Considerations

  • We'll disable data previews for data that is not in the datastore, because the dataproxy is too unreliable. All data previews will work via the datastore. The dataproxy will not be disabled but we won't use it any more.

  • We'll move the Data API button on the resource pages to make it much less prominent. Move it to the bottom of the page. Also make it mention all our APIs not just the Data API. Could also indicate here whether the datastore is enabled for the resource. Examples:

  • From the technical side, I think we want to get the datapusher service finished and have all our sites configured to use it. This means that all resource files (linked-to or uploaded ones) will get pushed into the datastore automatically soon after the file is uploaded or updated. In the meantime, we're working on a paster command/cron job to pull them in "manually" (this is much simpler to implement than the datastorer service, but after upload or update files don't get pulled into the datastore until the next time a sysadmin or cron job runs the command). This paster command will probably also come in handy in situations where the more complex datastorer service is not setup or has fallen over for some reason. I'm guessing we're going to throw away the ckanext-datastorer extension and never use it again. See Implementation below.

  • There was also a Catalog Only option mentioned, where we would deploy a CKAN for someone with no data preview (no preview at all, not even data proxy) and no datastore or datastorer, the resource download link more prominent, resource page much simpler showcasing URL, download link and additional info.

User Stories

Mark says he thinks these are common user stories:

  • I am a publisher with a large dataset. I want to easily make a small correction to my data so that anyone who uses the data in the future gets the correct version.

  • I am an experimental scientist. I want data from my field instruments to be recorded automatically and incrementally, so my calculations which use the data get all the most recent data included.

  • I am a local council. I want to publish a dataset of locations of street furniture, but include a mechanism for citizens to make/suggest corrections or additions, so that the data is as accurate as possible for anyone who uses it subsequently. (I probably need to be able to approve updates, so that malicious or spam changes are not presented to users.)

  • I am a researcher/data wrangler/journalist. I want to know which version of the data I have and how it has been processed, so that I understand the data I am working with.

  • I am a researcher/data wrangler/journalist. I want to be sure I have the most recent/accurate version of the data so that my results are as up-to-date / accurate as possible.

  • I work for a data publisher, and have some data in a spreadsheet. I want to understand clearly what I need to do to publish my data, so that I don't get paralysed by a confusing choice and give up.

Note the common feature of wanting the data to be accurate for anyone who uses it, which implies that there shouldn't be a particular way of accessing the data that gives a wrong or out-of-date version - at least, not unless there is a very clear health warning.

Implementation

DataPusher

Service to replace the former datastorer. The service will not accept packages, groups or organizations. This will be implemented in the CKAN extension (potentially the datastore itself).

The API for the service will roughly look like this:

{
    "api_key": "my-secret-key",
    "job_type": "push_to_datastore",
    "result_url": "https://www.ckan.org/datapusher/callback",
    "metadata": {
        "ckan_url": "http://www.ckan.org/",
        "resource_id": "3b2987d2-e0e8-413c-92f0-7f9bfe148adc"
    }
}

Datastore dump

Get a full resource as csv. Dumps are available at /datastore/dump/<resource_id>.

Datastore extension

Some additions, especially a paster command to trigger the DataPusher are required. A first version of this can be found at https://github.com/okfn/ckanext-datastorer/tree/36-datapusher.