Collecting open government data for maximum reuse - GovHackNZ/govhacknz-data GitHub Wiki

There are lots of lists of government data out there. How do we avoid making yet another list of lists of lists, and make something a little more reusable?

The steps I see we need to take are:

  1. Discover what existing formats there are for describing datasets with metadata. What are NZ government agencies already using, and where are they heading? The point of this step is to make sure our work can be reused easily by agencies.

There are two standards for this: RDF, which is used in it's "by attributes" variant; and microformats which are exactly the same thing but invented by people who didn't know how to do research. Google parse both of them.

Looks great, will investigate further, thanks Dave!

  1. Pick a standard and use it to set up a master Google spreadsheet for recording datasets. The point of this step is to have a single master list that anybody can update, which conforms to the chosen standard.

Like this?

Yes, but more targeted to the still-emerging themes of GovHackNZ, and also covering the many datasets that are available in some shape or form, but which aren't yet on data.govt.nz

  1. Set up a Google form to collect datasets, including metadata. The point of this step is to collect as much metadata as we can at moment we learn about a dataset.

  2. Define a mechanism to regularly export the master Google spreadsheet to a CSV on GitHub. The point of this step is to get it out of Google and in to a place and format that people can access programmatically.

  3. Add additional code and components (eg docker, postgres, deployment instructions) to this GitHub repository that will suck in the CSV, serve it up in a database format, and possibly also download all the datasets referenced in this database. The point of this is to make the metadata on these datasets accessible to anybody who wants it (for GovHackNZ and beyond), and optionally, to make the datasets themselves available.