Data archival with Zenodo - gbif-norway/documentation GitHub Wiki

Archival of datasets published to GBIF

Datasets should be archived directly into Zenodo.

  • The metadata published to GBIF should link to the Zenodo archive, and the Zenodo archive should link back to the DOI provided to the dataset by GBIF.

  • Add any datasets archived to the GBIF-Norway group.

  • Use the [email protected] account to upload datasets so they are not linked to your personal account.

  • NB! New versions of source data updated in the IPT should also be updated in Zenodo

Previously, data archiving was done in GitHub and Zenodo, and worked as follows:


Data archiving in GitHub & Zenodo require each dataset to be in a separate repositry https://guides.github.com/activities/citable-code/

Archive dataset with GitHub and Zenodo

GitHub

  • Create a new repository for the new dataset to publish in GBIF.no
  • Configure write permissions for team "GBIF.no" in Settings-Collaborators & teams
  • Add the original source datafile provided by the data owner.
  • Add the Darwin Core archive file created by IPT.
  • Consider using LFS for large files. The GitHub file size limit is 100MB, and git in general does not handle large files well.
  • Create a simple README.md for the repository with the dataset.
  • Remember to include a text file LICENSE.txt (in GitHub).

Zenodo

  • Login to Zenodo and configure webhooks on repositories (once).
  • Select your user profile, GitHub, and settings: https://zenodo.org/account/settings/github/
  • Toggle the flip from "off" to "on" for the GitHub repository of the dataset to archive.

Return to GitHub

  • Select dataset-repository and make a release (select menu item releases).
  • Release-tag: v1.1
  • Release-title: [organization_short_dataset_name_version]

  • Add a DOI button to the GitHub README.md
  • DOI
  • DOI

GBIF.no IPT

  • Add the Zenodo DOI under metadata -- External links
  • Name: Zenodo data archive
  • Download URL: https://doi.org/10.5281/zenodo....
  • Data format: Darwin Core archive

Dataset structure

Proposed:

├── LICENSE
├── README.md          <- README including basic metadata, or, perhaps own file for final metadata?
├── data
│   |── raw            <- The original, immutable data dump, possibly also including scanned field forms etc.. 
│   |── interim        <- Intermediate data that that are transformed and in a machine interpretable form 
│   |── DwC-A   <- The final, mapped data
│
├── docs               <- Supporting information, e.g. raw metadata and description from data-owners in text 
│
├── code             <- whatever code is used to transform and map the data