Overview of PyFluxPro - OzFlux/PyFluxPro GitHub Wiki

Overview of PyFluxPro

General

PyFluxPro takes data recorded at a flux tower and processes this data to a final, gap-filled product with Net Ecosystem Exchange (NEE) partitioned into Gross Primary Productivity (GPP) and Ecosystem Respiration (ER). The starting point is an Excel workbook containing flux tower data typically averaged over 30 or 60 minutes. The finish point is a collection of netCDF and Excel files containing gap filled and partitioned data at the tower time step and daily, monthly, annual and cumulative summaries. PyFluxPro can be used interactively, allowing users to fine tune the processing at each stage after reviewing plots of the intermediate data, or in a batch mode, where multiple processing stages can be linked together and performed without supervision.

Further descriptions of PyFluxPro are contained in 2 documents included in the PyFluxPro/docs directory: the PowerPoint presentation "An_Introduction_to_PyFluxPro.pptx" and a paper from the Biogeosciences OzFlux Special Issue in 2017 "bg-14-2903-2017.pdf".

Processing Levels

PyFluxPro is built around 6 processing levels:

Level 1 (L1) - reads data from an Excel workbook, combines it with user-specified metadata and writes the data and metadata to an L1 netCDF file.
Level 2 (L2) - reads the L1 netCDF file, applies the user specified quality control measures to the data, plots the pre- and post-quality controlled data (described in the Level 2 section) and writes the quality controlled data to an L2 netCDF file.
Level 3 (L3) - reads the L2 netCDF file, applies the user specified post-processing to the quality controlled data (described in the Level 3 section), plots the post-processed data and writes it to an L3 netCDF file.
Level 4 (L4) - reads the L3 netCDF file, gap fills the meteorological data using ancillary data from automated weather stations (AWS), numerical weather prediction (NWP) models and climatology, plots the gap-filled data and writes it to an L4 netCDF file.
Level 5 (L5) - reads the L4 netCDF file, applies the u* filter to the selected flux data, gap fills the flux data using a neural network (SOLO) or Marginal Distribution Sampling (MDS), plots the gap-filled data and writes it to an L5 netCDF file.
Level 6 (L6) - reads the L5 netCDF file, derives ER from the nocturnal, u*-filtered NEE data, trains 3 respiration models on the observed ER (SOLO, Lloyd-Taylor and Lasslop et al), partitions the gap-filled NEE into GPP and ER using the observed and modelled ER.

Levels 1 to 3 are typically done in chunks of a single year. Yearly L3 files can then be concatenated to a single file containing multiple years before applying the L4, L5 and L6 processing. However, PyFluxPro is flexible enough to allow users to process single years through all steps or to use multiple-year files for all stages. The user is free to process their data in whatever time chunks make sense to them.

Control Files

PyFluxPro uses text-based control files to specify which quality control and processing options are to be applied to which variables. The control files can be edited by the user within the application with a rich set of context-sensitive menus to guide the users choice of options.

The typical workflow for all processing levels is to open the appropriate control file in PyFluxPro, edit the control file using the right-click, context-sensitive menu or direct editing via the GUI to reflect the users processing choices, save any changes and then run PyFluxPro with the edited control file.

Detailed explanations of the control files and how to edit them are given in the sections for each processing level.

Typical Workflow using PyFluxPro

PyFluxPro is designed to be very flexible so there are a large number of possible data paths through the application. Here we describe the most typical data path and the one used by the example files, see https://github.com/OzFlux/PyFluxPro/wiki/Using-the-Examples. The diagram below shows the basic workflow and is taken from the PowerPoint presentation An_Introduction_to_PyFluxPro.pptx in the PyFluxPro/docs folder.

The starting point for PyFluxPro is usually data direct from the data logger (fluxes calculated by PyFluxPro from the covariances) or the output from a program for calculating fluxes such as EddyPro (fluxes calculated by EddyPro from the 10 or 20 Hz turbulence data). These can be read as CSV files but it is usually more convenient to assemble multiple CSV files into a single Excel L1 workbook with each CSV as a separate worksheet.

PyFluxPro doesn't care what period of data you use at a particular level. However, it's convenient, with long term sites, to use yearly files for L1 to L3 processing and then to concatenate the yearly L3 files to a single, multi-year file for processing beyond L3.

So, a typical workflow,for each year, for a site active over many years would be:

Create an L1 workbook for the current year, usually this is simply a matter of starting with an example L1 workbook (e.g. last years) as a template and copying this year's data into that template file.
Run the L1 workbook through to L3. You can use the L1 to L3 control files from the previous year as templates for this year's processing.
Update the L1 workbook with new data as it is received from the site and repeat step 2 above.
When a year is finished or when you want to push the data collected so far through gap filling and partitioning then;
1. Concatenate all years collected to date.
2. Do the climatology using the Utilities/Climatology menu option. The Excel workbook output by this step will be used at L4.
3. Do the u* threshold detection using the Utilities/u threshold* menu option. The Excel workbook output by this step will be used at L5.

The typical workflow for PyFluxPro between L3 and L4 would be as shown below.

The typical steps in moving between L3 and L4 involve:

Concatenating the yearly L3 files into a single, multi-year L3 file. See the concatenation control file examples in PFP_examples or in the PyFluxPro/controlfiles/templates. Concatenating files must be done through a control file because the user has to tell PyFluxPro what files have to be concatenated and what output file should be created.
Running the climatology on the multi-year L3 file. This process produces a climatology Excel workbook that can be used as a source of alternate data when gap filling meteorological data at L4. This can be done through the Utilities/Climatology menu option or by opening a climatology control file in the PyFluxPro GUI, editing the control file and running it.
Running the u* threshold detection routines on the multi-year L3 files to estimate the u* threshold for each year. This can be done through the Utilities/u threshold* menu option. There are 4 different methods available for estimating the u* threshold;
1. CPD (Barr) - a Python implementation of the original Barr et al MATLAB scripts.
2. CPD (McHugh) - an implementation of the Barr et al CPD technique based, more or less, on what was presented in the original paper.
3. CPD (McNew) - a reimplementation of the CPD method from Barr et al that is closer to the original MATLAB scripts but not a line-by-line translation of MATLAB to Python.
4. MPT - an implementation of the Moving Point Threshold technique from Papale et al using the same C code as used in the FluxNet 2015 synthesis

Processing from L4 to L6 is relatively straightforward:

L4 is designed to fill gaps in the meteorological data from the flux tower, see https://github.com/OzFlux/PyFluxPro/wiki/Level-4 for details. This is a necessary prerequisite to gap filling the fluxes at L5 since that process depends on having no gaps in the meteorological data. To fill the gaps in the meteorological data we can use;
1. Climatological data for the site calculated from the Utilities/Climatology step above.
2. Automatic weather station (AWS) data.
3. Numerical weather prediction (NWP) models.
4. Global reanalysis products such as ERA5.
L5 is designed to fill gaps in the fluxes from the flux tower, see https://github.com/OzFlux/PyFluxPro/wiki/Level-5. This is necessary to produce budgets of carbon (NEE) and water exchange at the site. PyFluxPro offers 3 strategies for filling gaps in the flux data:
1. A neural network (SOLO) optimised to fill gaps less than ~14 days (short gaps) using primarily site data.
2. Marginal Distribution Sampling (MDS) as described in Reichstein et al and used for the FluxNet 2015 synthesis.
3. A neural network (SOLO) optimised to fill gaps longer than ~14 days (long gaps) using data from remote sensing e.g. MODIS.
L6 is for partitioning Net Ecosystem Exchange (NEE) into Ecosystem Respiration (ER) and Gross Primary Productivity (GPP);
1. Here be Dragons ...

A Note About Time

The whole area of data logger time stamps and time zones is complicated and can get messy very quickly. This section explains a few things about the way PyFluxPro uses time:

PyFluxPro never alters the relationship between your data and its timestamp. That is sacrosanct! The timestamp that your data has when it is read in at L1 is always preserved.
The time zone of your site is entered in the Global section of the L1 control file and becomes a global attribute in the L1 and subsequent netCDF files. PyFluxPro does not use the time zone but it does check the specified time zone against the site latitude and longitude to make sure they are consistent. A full list of time zones in the format expected by PyFluxPro is available at https://en.wikipedia.org/wiki/List_of_tz_database_time_zones.
L1 to L3, L5 and L6 processing make no assumptions about the timestamp of your data. When using alternate data to gap fill meteorological data at L4, PyFluxPro assumes that the timestamp of the flux tower data and the alternate data are in the same time zone. TERN Ecosystem Processes Central Node supplies alternate data from automatic weather stations, numerical weather prediction models and reanalyses for each TERN/OzFlux site in local time for that site.
PyFluxPro makes no assumption about the timestamp referring to the start or end of the averaging period. However, the alternate data supplied by TERN EPCN for gap filling at L4 uses the convention that the timestamp refers to the end of the averaging period. To be consistent with this, the timestamp of the data at L1 should also refer to the end of the averaging period. NOTE: the gap filling at L4 removes any phase difference between the alternate and flux tower data by doing a lagged correlation before the least squares fit and this should deal with any timestamp inconsistencies but it is best to not rely on this feature.
Date and times used internally by PyFluxPro do not have a time zone associated with them. But we think we should make PyFluxPro timezone aware as soon as possible. What do you think?