Folder structure - aus-plant-phenomics-network/appn-implementation GitHub Wiki

NOTE: Following call on 2025-07-08, references in this document to "levels" and "Level0_raw", etc. were changed to refer to "tiers" and "T0_raw", etc.

APPN will document the contents of every data package using JSON-LD metadata. This approach should accommodate almost any internal organisation for the contents of the package. APPN tools, landing pages and APIs will use the metadata to surface assets according the interests and needs of users.

Nevertheless, nodes are recommended to follow common practices for organising data inside each dataset, since this will facilitate ready interpretation by any future user, even if they are unaware of APPN's data model and unable to interpret the JSON-LD metadata.

It makes sense that nodes will face some different tensions in how they organise their data internally.

Given the central data team's focus on delivering a data package that meets the needs of the clients that requested a study, we naturally prioritise the study-first view and a sub-organisation that makes clear the processing tiers, so something like this feels to me like the default:

  • Study A
    • Metadata, etc.
    • T0_raw
      • Sensor 1
      • Sensor 2
    • T1_proc
      • Sensor 1
      • Sensor 2
    • T2_traits
  • Study B
    • ...

This would mean that any future user will immediately see the raw/processed/trait division and be able to orient themselves to their own interests.

However, this may be a dreadful way for a node to handle the management of data during the execution of the study. Data from different sensors may feed into computers that process just that part of the study. In such a case, the data on each computer may be organised by sensor or platform. A side-benefit may then be that write-access to the folder can easily be restricted to the systems and tools that are collecting this type of data. That may allow for real optimisation, especially for high-bandwidth data. That will lead to something like the following, with the various root folders possibly on completely different file systems (bold text explained below).

  • Studies
    • Study A
      • Metadata, etc. including references to sensor data locations and time periods
      • T2_traits
    • Study B
      • Metadata, etc. including references to sensor data locations and time periods
      • T2_traits
  • Sensor 1 <-- Sensor commissioned explicitly for each project, i.e. with metadata identifying the study
    • Study A
      • Metadata, etc.
      • T0_raw
      • T1_proc
    • Study B
      • Metadata, etc.
      • T0_raw
      • T1_proc
  • Sensor 2 <-- Sensor producing continuous time series without reference to any study
    • T0_raw
      • Date 1
      • Date 2
      • ...
    • T1_proc <-- May not exist if sensor is simple and reliable
      • Date 1
      • Date 2
      • ...

The idea with Sensor 2 is that it is something like a weather station that simply collects data as a time series. People don't "tell" it that it is now contributing to Study A. They just use the data as part of the analyses for Study A, so they need a way to refer to the subset of the time series that relates to the study.

In this case, packaging the data for Study A (which took place entirely on Date 1) in a canonical form for publication (as in the first bulleted list) would involve aggregating and reorganising the elements with bold text into a new structure with Study A as the root.

The data from Sensor 2 could just be published as its own (regularly updated) dataset or series of datasets (e.g. monthly), in which case, the final packaged Study A dataset could just refer to the relevant Sensor 2 datasets and time periods from inside its metadata. Any tier 2 results of using these data in Study A would be part of the Study A dataset, but if a user wants to view or reanalyse the tier 0/1 data from Sensor 2, they will need to access the relevant dataset. Either approach will work and APPN tools should be able to reorganise and facilitate subsequent re-slicing by users.

I think we have to allow this kind of flexibility, but we can make some recommendations (note that I have used should throughout rather than must):

  • Any folder containing primary sensor readings or images should have the name "T0_raw" (case-insensitive, so "t0_raw", "T0_Raw", etc. are valid)
  • Any folder containing cleaned, normalised sensor readings or images should have the name "T1_proc"
  • Any folder containing derived plant (and environmental) trait data should have the name "T2_traits"
  • Any folder containing other intermediate data products may be given any appopriate name and will not be interpreted in final data packaging steps - we aim to document the tools used to get from tier 0 to tier 1 and tier 1 to tier 2 in such a way that only these tiers need to be retained
  • Each folder that brings together the data for a particular study or from a particular platform/sensor (or associated with some other common element like growth facility, variety, etc.) should contain a simple metadata file called (off the top of my head) "config_metadata.json" or "config_metadata.yaml" (case insensitive) that identifies the common elements in a way that assists data and metadata integration (basically using the id or a documented label for each study, sensor, etc.)

The final bullet means that in the second bulleted folder structure above, many of the folders should contain a small file containing one or more key-value pairs. Using YAML examples, and assuming we are using URIs as identifiers for everything (in reality, we'll also need to handle well-managed text labels that locally refer uniquely to each element), something like this:

In Studies > Study A and in Sensor1 > Study A:

config:
    - study:
        - id: https://ld.plantphenotyping.org.au/uq/study/ozbarley_winter_2025

In Sensor 1:

config:
    - sensor:
        - id: https://ld.plantphenotyping.org.au/uq/sensor/gobi_0001

In Sensor 2:

config:
    - sensor:
        - id: https://ld.plantphenotyping.org.au/uq/sensor/weather_station_6

In Sensor 2 > T0_raw > Date1 and Sensor 2 > T1_proc > Date1:

config:
    - startTime: "2025-07-03T00:00:00+10:00"
    - endTime: "2025-07-03T23:59:59+10:00"

That way, we can walk the folder hierarchy for any file and collect all the metadata elements that need to be inferred for that file. For example, for all files in Sensor 1 > Study A, we can infer:

config:
    - study:
        - id: https://ld.plantphenotyping.org.au/uq/study/ozbarley_winter_2025
    - sensor:
        - id: https://ld.plantphenotyping.org.au/uq/sensor/gobi_0001