Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKAN Import App #18

Open
rufuspollock opened this issue Mar 12, 2014 · 2 comments
Open

CKAN Import App #18

rufuspollock opened this issue Mar 12, 2014 · 2 comments

Comments

@rufuspollock
Copy link
Member

Summary: super-simple one click (automated) import of data into CKAN (and its DataStore)

And/or integration with specific services e.g.

Aside: may want to split this into individual ideas for each import source


User Stories

Persona:

  • Data User - less sophisticated (uses Excel but may not know what an API is)
  • Data Wrangler - more sophisticated (knows what an API is)

Import File and get Data API

As a Data Wrangler I want to provide my file and have it imported into CKAN so that I get a Data API

What kind of file?

  • CSV file
  • Excel file
  • GeoJSON file
  • ...

How do I provide

  • web interface
  • API (POST/GET url string or POST file content)

Questions:

  • Do we validate the file?
  • Do we have some process for e.g. tweaking the field types
  • What is the mapping between file and Dataset / Resource

Implementation

  • DataPusher already does most of this
    • What's missing is any kind of edit metadata step
    • No user interface

As a XXX I want to push my data file to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date

  • This is very similar to import file - only difference is we get push notifications (github webhooks). so merge this with that example.

Github import

As a XXX I want to push my tabular data package to github and have it automatically create/update the CKAN DataStore so that my Data API is up to date

  • As it is already a data package importing should be very simple
  • If file is large we may need to worry about queues etc but probably keep it simple for present
  • How do we determine dataset to associate this with in CKAN?

One-Click Create a Dataset

As a XXX I want to provide my file and have it imported into CKAN so that I get a nice Dataset

  • what distinguishes from existing system? Ans: one-click nature

Automated regular import

As a Data Wrangler I want to have my data file automatically re-imported at regular intervals so that the DataStore (and Data API) stays up to date with my data.


Discussion

Datastore is a great feature of CKAN and it would be great to support getting data into it. In fact, one could go as far as to say that DataStore is the "killer" feature as by having data in the DataStore you get several major value-adds such as:

  • A Data API
  • Improved quality

In addition, for data to get into the data store it has to be of a reasonable quality so that data from datasets in the DataStore is likely to be of a higher quality. Whilst, obviously not a feature itself of the DataStore, this is an indirect benefit as the DataStore can help both "label" and drive data quality.

How Should It Work?

Import: Automatic or User Initiated

In getting data into the DataStore there are various choices about how it works. One key choice is whether:

  • Import happens automatically, i.e. happens automatically once a dataset resource (of appropriate type e.g. excel or csv) is added to the "Catalog"
  • Import is user initiated in one way or another (though the actual process once initiated may be fairly automatic)

Would argue that second option is best - i.e. import should user initiated import.

Why? There is huge variability in data quality. Without reasonable quality of data (i.e. no blank lines at top of CSV etc) data import is likely to result in a poor outcome and/or be very hard to automate.

Assuming user initiated there are still several options (not mutually exclusive):

  1. Integrated into DataHub UI (e.g. "import to DataStore")
  2. import.datahub.io: Create a bespoke UI outside of primary datahub site for doing imports
  3. Leave it to users to push data into the DataStore via their own tools (tools we could help create)

Implementation

Any implementation with a UI (i.e. ignoring API usage) likely has 3 parts:

  • UI for import initiation, reporting and management
  • Validation prior to import
  • Importer worker (actually run the import)

Import worker - DataPusher

Probably want to use DataPusher: https://github.com/ckan/datapusher though not hard to roll one's own.

Basic steps on how to do this are in the docs: http://docs.ckan.org/en/latest/maintaining/datastore.html#datapusher-automatically-add-data-to-the-datastore and http://docs.ckan.org/projects/datapusher/en/latest/

I got DataPusher deployed on Heroku about a month ago - see ckan/datapusher#23

@rufuspollock
Copy link
Member Author

Added (substantial) set of user stories.

@rufuspollock
Copy link
Member Author

Some work in progress on a new nodejs app focused on data package import at https://github.com/rgrp/ckan-import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Tidy Up
  
RFCs (major)
Development

No branches or pull requests

1 participant