Data collection and packaging pipelines - aus-plant-phenomics-network/appn-implementation GitHub Wiki
Data packaging
The schematic provides an overview of the processing pipeline for data generated by nodes (including data from field and controlled environment sites and mobile phenotyping units):
- Each node collects data following its own practices (informed by APPN best practices developed via the APPN Expert Working Groups) and using its own instruments. Although subsequent stages of the pipeline should accommodate any sensible folder structure, nodes are encouraged to follow the APPN guidelines for folder structure. Not shown here is the process for nodes to register the terms used in their local data (identifiers for platforms, sensors, traits, variables, plant varieties, etc.). These will be published as vocabulary terms that can be mapped to public URIs that can be resolved for definitions and associated property values.
- The APPN central data team is developing a Python library to assist with data packaging as each study progresses. The goal is to supply tools that can be plugged into each node's workflows and systems to simplify and automate construction of well-described RO-Crate datasets. This will include scripts for transforming different local data tables and spreadsheets into JSON-LD objects and CSV data that conform with the APPN Domain Model.
- RO-Crates will grow and be populated as each study is executed.
- At the end of the study, the RO-Crate will be archived as a ZIP/TAR representing a valid and FAIR dataset containing all metadata required for downstream processing and aggregation.
- The APPN central data team is developing tools for publishing these assets into shared APPN S3 storage at Pawsey and (in future) NCI. Nodes will have the alternative of publishing to their own local repositories where this is more appropriate or meets local obligations. The publishing step will include validation and the generation of a JSON document that includes all information required on the landing page for the dataset. This document will be produced through automated processing of JSON-LD metadata and other content in each RO-Crate. As APPN proceeds, this step may also apply policies to extract certain key assets (e.g. orthomosaics) and make these directly accessible via URLs and may automatically generate STAC metadata records to maintain a STAC catalogue. This latter component could alternatively be generated at a later point in the flow. The central APPN storage will follow OCFL guidelines for organisation and fixity.
- S3 buckets will be organised to support access control at least at node granularity and likely with finer granularity to ensure compliance with data ownership. In future, APPN-wide policies will be applied to migrate older or less-used assets onto colder (less expensive, lower carbon, slower access) S3 storage.
- JSON metadata documents will be indexed in a database to support query and access. Each will be associated with a DOI with Datacite metadata including all relevant identifiers and URLs for direct acces to the RO-Crate and any key assets that have been extracted separately. It may be possible to develop a relational data model for these metadata, but a NoSQL alternative is likely to offer greater flexibility as data diversity increases.
- All elements represented in the APPN Domain Model (primarily the JSON-LD and CSV components) will further be indexed into a database to support aggregated and faceted access.
- The data index This again could be structured as a relational data model, but retaining the data as a graph may make it easier to accommodate the full richness of the data as the model expands and to support dynamic integration with external data sources.
- The future APPN data portal will use the metadata database and the data index to enable users to search the APPN data collection and to access, export and download data from individual datasets or (via linkages made in the data index) to download aggregated subsets of the collection.