Level 2 - OzFlux/PyFluxPro GitHub Wiki
Level 2 - Quality control of data
Overview
Level 2, or L2 for short, is the stage where PyFluxPro allows you to remove what you believe to be bad data and to plot the results of the quality control process so you can make sure you are doing something sensible. L2 reads an L1 netCDF file, produces an L2 netCDF and plots of the variables before and after quality control.
Eddy covariance is an inherently noisy measurment technique and the "noisiness" or signal to noise ratio varies a great deal from site to site and even from season to season within a single site. In general, it's not possible to use flux tower data without some sort of quality control and this is especially true when you get to the later stages of processing, gap filling and partitioning NEE into GPP and ER. More complex processing steps often act as "noise amplifiers".
Programs that calculate the average fluxes from 10 or 20 Hz data, such as EddyPro, often have very good schemes for detecting bad data based on various statistics compiled from the 10 or 20 Hz data and these can be a good indicator of data quality. Often, though, data which is obviously bad is still present even when these objective methods are applied. Also, we often have to clean up the non-flux data as well. Sensors fail in many ways and sometimes they continue to produce data that looks similar to good data but is, in fact, bad data. We need some general approaches to cleaning up the data from flux towers and PyFluxPro provides these approaches.
A word of caution. Many of these quality control techniques are subjective, not objective. They allow you to remove what you consider to be bad data so that later processing stages, particularly those based on statistical techniques such as u* threshold detection and gap filling, will work effectively. However, their subjective nature makes them dangerous. It is possible to produce very clean-looking, smooth data sets using these quality control techniques but in the process you may remove a lot of good data simply because it doesn't conform to your idea of how your ecosystem shoud behave. Use these techniques carefully, not blindly, and always ask yourself if your processing methods support your intended analysis.
Before we leave the overview of quality control, it is worth restating Isaac's First Law (quality control version):
- It is far easier to collect good data from the start than to clean up bad data.
The Quality Control Checks
The general quality control method is as follows:
- The user specifies the quality control checks they want to apply in the L2 control file. The quality control checks must be applied to each individual variable.
- When L2 is run, PyFluxPro will check each data value against the quality control checks you have specified in the L2 control file;
- If the data value passes the quality control check, it is left as is and the quality control flag for that data value (see Overview of PyFluxPro) is left as 0 (good data).
- If the data value fails the quality control check, the data value is set to -9999 and the quality control flag for that data value is set to the unique value for that quality control check.
The quality control checks available in PyFluxPro are described below:
- DependencyCheck rejects data values for this variable depending on the state of one or many prerquisite variables. If any of the prerequisite variables fail their quality control checks then the data value for this variable is also rejected.
- DiurnalCheck rejects data values that are more than a specified number of standard deviations away from the mean value for this hour of the day. The user specifies the number of standard deviations. If you enter a single value for the number of standard deviations, the single value is used for the whole data set. You can enter 12 values separated by commas to specify different values for each month of the year.
- ExcludeDates allows the user to reject data based on one or more date ranges.
- ExcludeHours allows the user to reject data for specified hours of the day for one or more date ranges.
- LowerCheck rejects data that lies below a lower bound for an arbitrary number of date ranges, the lower bound is piece-wise linear.
- RangeCheck rejects data values that lie outside the specified lower and upper ranges specified by the user. If you enter a single value for the lower and upper limits, the single value is used for the whole data set. You can enter 12 values separated by commas to specify different values for each month of the year.
- UpperCheck rejects data that lies above an upper bound for an arbitrary number of date ranges, the upper bound is piece-wise linear.
In addition to the quality control checks, PyFluxPro can apply 2 functions at L2:
- CorrectWindDirection allows the user to add an offset to wind directions for an arbitrary number of date ranges.
- Linear allows the user to apply a linear correction to any variable for an arbitrary number of data ranges.
PyFluxPro can also automatically apply quality control checks based on the sonic anemometer data (SONIC_Check) and the IRGA data (IRGA_Check). Both of these are types of dependency checks, see below:
- SONIC_Check uses the sonic anememeter diagnostic word, standard deviations of vertical wind speed and virtual temperature. If either of these quantities are outside the ranges specified by the user then all variables that use data from the anemometer are rejected. These include the covariances between the wind speed components (Ux, Uy and Uz) and the scalars (AH, CO2 and Ta).
- IRGA_Check uses the IRGA diagnostice work, the H2O and CO2 signal strength or the AGC and the standard deviations of H2O and CO2 concentrations. If any of these quantities are outside the ranges specified by the user then all variables that use data from the IRGA are rejected. These include the covariances between the wind speed components (Ux, Uy and Uz) and the scalars (AH, CO2 and Ta).
The SONIC_Check and IRGA_Check can be disabled, see the description of the Options section below.
The L2 Control File
The L2 control file consists of the following sections:
- Files
- Options
- Variables
- Plots
The contents of these sections and how to edit them are described below.
The Files Section
Description of the Files section
The Files section allows the user to specify the path to the input and output files, the names of the input and output files and the path for plots generated by the L2 processing, see the screenshot below.
The entries in the Files section are as follows:
- file_path - the path to the data files
- in_filename - the input file name
- out_filename - the output file name
- plot_path - the path for plots generated by the L2 processing
Editing the Files Section
The entries in Files section can be edited by right clicking on the entry in the Value column and using the Browse... feature or by double clicking on the entry in the Value column and manually entering the required text.
The Options Section
Description of the Options Section
The Options section allows the user to specify the options that control some aspects of the L2 processing.
The Options section contains an entry called irga_type with the default value of Li-7500. You can right click on the irga_type value to display a list of options for irga_type, see below.
A list of the options available can be displayed by right clicking on the Options section title in the Parameter column, see the screenshot below.
The SONIC_Check and IRGA_Check options allow the user to turn off the default SONIC and IRGA checks.
The Variables Section
Description of the Variables section
The Variables section is where the user specifies the quality control checks to be applied to the variables read in at L1. An example of a variable subsection from the Loxton control file is shown below.
The name of the variable sub-section in the Parameter column, AH_HMP_10m in this case, is the variable name in the netCDF file.
Each quality control check forms a sub-section within the variable section and each of the quality control checks has additional entries that specify how the check will be used. In the example above, the RangeCheck will be applied with a lower limit of 0.0 and an upper limit of 35.0. You can specify a different limit for each month of the year by entering 12 values separated by commas.
You can add quality control checks to a variable by right clicking on the variable name to display a context menu with the available quality control checks. Select one and it will be added to the variable section.
The sections below desribes the options for each of the quality control checks.
Description of the Quality Control Checks
DependencyCheck
The example above makes the AH_HMP_10M variable dependent on the Ta_HMP_10m variable. Any records rejected from Ta_HMP_10m will laso be rejected from AH_HMP_10m.
DiurnalCheck
The example above will reject any AH_HMP_10m data that lies outside of the range +/- 5 standard deviations about the mean for the the time of day. For example, the mean and standard deviation are calculated for all data at 06:00 and any points at 06:00 that are outside +/- 5 standard deviations of that mean are rejected.
ExcludeDates
The example above will exclude all data between 2021-01-01 00:30 and 2021-02-01 00:00 i.e. the month of January. You can add additional date ranges by right clicking on ExcludeDates and selecting Add date range.
ExcludeHours
The example above will exclude all data for the hours of 12:30 and 13:00 between 2021-01-01 00:30 and 2021-02-01 00:00 i.e. the month of January. You can add additional date ranges by right clicking on ExcludeHours and selecting Add hour range.
LowerCheck and UpperCheck
The example above will apply:
- A lower limit to the data that starts at 0.0 on 2021-01-01 and rises to 10.0 on 2021-02-01.
- An upper limit to the data that starts at 30.0 on 2021-01-01 and drops to 20.0 on 2021-02-01.
You can add additional date ranges by right clicking on LowerCheck or UpperCheck and selecting Add date range.
Editing of the Variables Section
Editing the contents of the Variables section is similar to editing other sections in the L2 control file. Items can be added to or removed from the section using a context-sensitive menu that is displayed when the user right clicks on the section or sub-section titles in the Parameter column. Entries in the Value column can be edited by double clicking on the text in the Value column and editing the text.
Removing a Variable
Variables can be removed from the Variables section by right clicking on the variable name and selecting Remove variable, see the screenshot below.
Adding a New Variable
New variables can be added to the this section by right clicking on the Variables section title in the Parameter column and selecting Add variable from the displayed context menu. The new variable is added after the last entry in the Variables section and is given the name **New variable **, see the screenshot below.
You can also add a new variable immediately above an existing variable by right clicking on the variable name and selecting New variable.
The Plots Section
The Plots section is used to specify the number and type of plots to be produced at the end of the L2 processing. Visualising data through plots is an important part of processing data using PyFluxPro. At L2, the plots are designed to show the pre- and post-quality control data so the user can assess how well the quality control checks have worked. It is often necessary to repeat the L2 processing several times while changing the quality control limits or trying different combinations of quality control before achieving the desired results.
The Plots section consists of an arbitrary number of sub-sections and each sub-section name is the title of a plot. Each plot sub-section has a single entry called variables which is a comma separated list of variables to be plotted, see the screenshot below.
You can edit the list of variables to be plotted by double clicking on variables entry in the Value column.
Adding and Removing Plot
Plots can be added to the Plots section by right clicking on the Plots section title and selecting the type of plot to be added, see below.
To remove a plot, right click on the plot in the Plots section and select Remove plot, see the screenshot below.
Disable and Enable Plots
It can take a long time to render all of the requested plots to the screen and this can be frustrating if the user is iterating around a particular set of quality control checks for a small number of variables.
Plots defined in the Plots section can be disabled or enabled as required. Disabled plots wont be rendered to the screen and this speeds up the process of examing a small group of variables. To disable or enable a plot, right click on the plot name and select the appropriate option, see the screenshot above. Disabled plots are labelled with (disabled), see below.
Running L2
Once the user has finished editing the L2 control file, it can be run by using the Current option of the Run entry on the main menu. The shortcut to run the current control file is Ctrl+R (press and hold down the control key and press the R key).
Output from Running L2
Files Produced During L2
The L2 processing produces 2 output files:
- The L2 netCDF containing the same variables read in at L1 but with data identified by the quality control checks as bad set to values of -9999 and each variable is given a corresponding QC flag whose value represents the quality status of a data value.
- An Excel workbook containing summaries of the quality control flag values for each variable.
Plots Produced During L2
PyFluxPro produces plots of the quality controlled data at L2. An example of one of these plots is given below.
The L2 plots have several components. Each plot has 3 columns and can have one or more rows with each row representing data for a particular variable.
The left-most column is time series plots of the L1 (plotted on the left axis in blue) and L2 (plotted on the right axis in red) data. This allows the user to quickly see the affect of the quality control checks. In the example above, it is clear that the scales of the L1 time series (left Y axis labels in blue) are quite different from the scales of the L2 time series (right Y axis in red) as a result of the rejection of some L1 data by the quality control checks. In paticular, the L1 time series for Fe_EP and Fh_EP (plotted in blue) show now seasonal variation. In contrast, the L2 time series for these variables (in red) shows the expected seasonal pattern. The quality control checks have removed outliers in the L1 data that obscured the expected seasonal patterns.
Each time series panel has a legend entry in blue in the top left corner of the plot and a legend entry in red in the top right corner of the plot. The blue legend (top left) relates to the L1 data and the red legend (top right) relates to the L2 data. For the L1 legend, the numbers in blue after the variable label and units are the total number of points for this variable read from the L1 file, the number of points present in the L1 file and the number of points missing from the L1 file. For the L2 legend, the numbers in red are the number of points that have passed the L2 quality control checks and the number of points that are either missing or have failed the L2 quality control checks. The difference between the blue and red numbers is an indication of how many points the quality control checks have removed.
The second column shows diel plots (value as a function of time of day) of the variable after the L2 quality control checks. The yellow line is the average value as a function of thime of day for the whole data set. The red and blue lines are the maximum and minimum values respectively also as functions of the time of day. If DiurnalCheck has been used as a quality control check for the variable, the specified range about the mean value is shown as dots (red for the upper limit, blue for the lower limit) on the diel plot, see the top row, Fco2, as an example.
The right-most column is histograms of the quality control flags where the X axis is the quality control flag value and the Y axis is the number of each flag value after the L2 quality control checks. With respect to the example plot above, a flag value of 2 indicates data rejected by RangeCheck and a flag value of 5 indicates data rejected by DiurnalCheck.