Level 4 - OzFlux/PyFluxPro GitHub Wiki

Level 4 - Gap filling of meteorology

Overview

Level 4, or L4 for short, is the stage where PyFluxPro fills gaps in the meteorological data. This is a precursor to gap filling the turbulent fluxes at L5 since most methods for gap filling fluxes rely on there being no gaps in the meteorological variables used as drivers. Prior to gap filling the meteorological data at L4, it is usual to concatenate L3 files containing individual years into a single L3 file containing multiple years, see the Wiki section on concatenation. Also, it is often convenient to generate a climatology file using the concatenated L3 file, see the Wiki section on climatology. The L3 climatology file can then be used to fill gaps at L4 when no other alternative data is present.

As with the other processing levels, options for the gap filling process are specified in a control file that can be edited in the PyFluxPro GUI. Templates for L4 control files are in the PyFluxPro/controlfiles/templates/L4 folder.

PyFluxPro uses 2 types of data when filling gaps in the meteorological variables:

Data from aternative sources. This can be data from a paired site, from a nearby automated weather station (AWS), data from a numerical weather prediction (NWP) model e.g. ACCESS in Australia or data from a meteorological reanalysis product e.g. ERA5 for the globe.
Climatological data from the site.

The primary source of data for gap filling at L4 comes from alternative sources e.g. AWS, ACCESS and ERA5. However, there may be times when all 3 are missing, for example, when gaps in their own data align (very rare but not unknown) or when the AWS, ACCESS and ERA5 ends before the tower data (due to the monthly update cycle). In these cases, the fall back is climatology and this gap filling data source is used to make sure there are no gaps in the meteorological data.

Before we leave the overview of gap filling, it is worth restating Isaac's First Law (gap filling version):

It is far easier to collect good data from the start than to gap fill bad data.

Background

Gap Filling Meteorological Data

General

Climatological methods for filling gaps in meteorological data work well for gaps of a few days but may work poorly for longer gaps because they are unable to reproduce the variability in meteorological data at synoptic time scales of around 5 to 7 days. Data from alternate sources can work extremely well if the alternate site is within several kilometers of the site being gap filled. Unfortunately, most sites are not part of a pair and even when an AWS is located near the site, they don't usually measure important quantities such as radiation and soil temperature, soil moisture etc.

NWP models and reanalysis products have made significant advances in recent years and are now capable of producing time series of meteorological quantities for a specified location that correlate very highly with measurements taken at that location. NWP and reanalysis data is now of such quality that, when used with caution and an appreciation of the increased uncertainty using this data will cause, it can be used as a source of alternate data when gap filling flux tower records of meteorological quantities. However, it is no substitute for measurements taken at the site and the best way to ensure good quality data is avoid gaps in the data. Monitor your site every day, have spare instruments available and budget for emergency trips to your site to replace failed instruments. It is much easier to collect good quality data in the first place than it is to clean up bad data and fill the gaps later.

PyFluxPro can use data from the following alternate sources:

AWS sites;
1. Australia - the Bureau of Meteorology (BoM) has a network of approximately 650 AWS sites that provide measurements of air pressure, air temperature, humidity, wind speed, wind direction and precipitation every 30 minutes. The TERN Ecosystem Processes Central Node (EPCN) makes this data from the nearest 4 AWS available for all active OzFlux sites with monthly updates.
2. Global - the National Oceanic and Atmospheric Administration (NOAA) makes data from approximately 14,000 global AWS sites available in near-real time as the Integrated Surface Database (ISD) with data at intervals of between 30 minutes and 3 hours. EPCN can provide scripts to harvest this data and reformat it for use with PyFluxPro.
NWP model;
1. Australia - the BoM makes data from its ACCESS-G forecast model (12.5 km horizontal resolution at 1 hour time step) available to registered users in near-real time. EPCN makes this data available for all active OzFlux sites with monthly updates. This data set contains radiation, meteorlogical and soil data for a 3 x 3 cut out (i.e. 9 variables for each quantity) centred on the flux tower.
2. Global - the local weather organisation in many countries has similar NWP forecast data available in near-real time.
ERA5 global reanalysis;
1. Australia - EPCN harvests this product (30 km horizontal resolution at 1 hour time step) every month and makes it available for all active OzFlux sites. There is a 5-day lag behind real-time in the ERA5 data. This data set contains radiation, meteorlogical and soil data.
2. Global - EPCN can provide scripts to harvest this data and reformat it for use with PyFluxPro.

Removing bias from alternate data

Meteorological data from an alternate source, whether it is from a nearby AWS, an NWP model or the ERA5 global reanalysis, may show significant differences to the same quantities measured at a site. This can be caused by differences in height, smoothing caused the the horizontal resolution of the model or reanalysis or local flows caused by terrain. It is important to reduce these biases before using the alternate data to fill gaps in the meteorological data from a flux tower.

PyFluxPro tries to reduce any bias between the alternate and site data as follows:

PyFluxPro selects a period, typically 3 months, of data for analysis.
PyFluxPro then selects the AWS site that has the highest correlation with the flux tower data for this period. Note that, theoretically, different AWS sites can be chosen for different data e.g. air temperature, hunidity wind speed etc but this is very rare.
PyFluxPro then does a lagged correlation to remove any phase differences between the AWS and the flux tower data. This step is usually omitted for precipitation.
PyFluxPro then fits a user specified least squares regression to the AWS and flux tower data and applies this best fit to the AWS data to reduce any bias. This step is usually omitted for precipitation.
Steps 2 to 4 are repeated for the ACCESS-G and ERA5 data.
The fitted alternate data is then merged with the observations in the order specified by the user.
PyFluxPro moves to the next non-overlapping window.
Any window with less than a user specified amount of good data (default is 50%) is put aside until all windows are done. PyFluxPro then returns to the unfilled periods, finds the remaining gaps and moves the window start forward and the window end backward by 1 day at a time until the minimum good data criteria is met. It then does the regression between the alternate and flux tower data using this period.

The above methodology is designed to fit the alternate data to the flux tower data over a relatively short period (3 months by default). This improves the correlation between the alternate and flux tower data which in turn reduces the tendency for the regression to flatten the diurnal variation in quantities like radiation, air temperature etc. See Isaac et al 2016 for a detailed description of this process and an analysis of the effect of window length on the correlation between the flux tower and alternate data.

The L4 Control File

The L4 control file consists of the following sections:

Files
Options
GUI (optional, batch mode only)
Drivers

The contents of these sections and how to edit them are described below.

The Files Section

Description of the Files section

The Files section allows the user to specify the path to the input and output files, the names of the input and output files and the path for plots generated by the L4 processing. The screenshot below shows the L4 control file for the Calperum example site. This example uses alternate data from AWS sites, the ACCESS-G NWP model, the ERA5 reanalysis and the site climatology to fill gaps in the meteorological drivers.

The entries in the Files section are as follows:

file_path - the path to the data files
in_filename - the input file name
out_filename - the output file name
plot_path - the path for plots generated by the L4 processing
aws - the full path and name of the AWS alternate data file
access - the full path and name of the ACCESS-G alternate data file
era5 - the full path and name of the ERA5 alternate data file
climatology - the full path and name of the site climatology data file

Editing the Files Section

The entries in Files section can be edited by right clicking on the entry in the Value column or by double clicking on the entry in the Value column and manually entering the required text.

The Options Section

Description of the Options section

The Options section allows the user to specify the options that control some aspects of the L4 processing. A list of the options available can be displayed by right clicking on the Options section title in the Parameter column, see the screenshot below.

The options are as follows:

MaxGapInterpolate - the maximum length, in hours, of gaps to be filled by interpolation, default is 3 hours.
InterpolateType - the type of interpolation, options are Akima (default) and linear.
KeepIntermediateSeries - keep the intermediate series generated during the L4 processing, the default is no.

The KeepIntermediateSeries option can be useful when assessing how well the alternate data matches the flux tower data. Setting this option to yes will result in all of the alternate data series being written to the L4 output file. The user can then open the L4 output file in PyFluxPro and plot time series of the tower and alternate data on the same axes.

The Drivers Section

Description of the Drivers section

The Drivers section is where the user specifies the variables to be gap filled, the methods to used to gap fill a variable and the gap filling method options. Each variable to be gap filled is a separate sub-section under the Drivers section and each gap filling method to be used for the variable (GapFillFromAlternate, GapFillFromClimatology) is a separate sub-section under the variable sub-section. Every variable sub-section must also contain an instruction to merge the gap fill data with the original variable to produce the gap filled data (MergeSeries). The following sections describe each entry in the variable sub-section in detail.

GapFillUsingAlternate

The screenshot below shows the use of the GapFillFromAlternate method for absolute humidity, AH, from the Calperum example site.

The entries under the GapFillFromAlternate sub-section are as follows:

AH_aws - the name of the variable that will contain the gap filling data generated from the AWS data;
1. source - the source of the gap filling data. In this example, the source of the gap filling data is the file (../PFP_examples/Calperum/Data/AWS/Calperum_AWS.nc) corresponding to the aws entry in the Files section.
AH_access - the name of the variable that will contain the gap filling data generated from the ACCESS data;
1. source - the source of the gap filling data. In this example, the source of the gap filling data is the file (../PFP_examples/Calperum/Data/ACCESS/Calperum_ACCESS.nc) corresponding to the access entry in the Files section.
AH_era5 - the name of the variable that will contain the gap filling data generated from the ERA5 data;
1. source - the source of the gap filling data. In this example, the source of the gap filling data is the file (../PFP_examples/Calperum/Data/ERA5/Calperum_ERA5.nc) corresponding to the era5 entry in the Files section.

In this example, gap filling data will be prepared from the AWS, ACCESS and ERA5 data by fitting the AWS, ACCESS and ERA5 data to the tower data and writing the fitted data to the variables AH_aws, AH_access and AH_era5 respectively. These variables will be merged with the observations later in the gap filling process, see the MergeSeries section.

GapFillFromClimatology

The screenshot below shows the use of the GapFillFromClimatology method for absolute humidity, AH, from the Calperum example site.

The entries under the GapFillFromClimatology sub-section are as follows:

AH_cli - the name of the variable that will contain the gap filling data generated from the site climatology;
1. method - the methood used to produce the climatological data. The only option available at presnt is interpolated daily, see the Climatology section for a description of this method.

MergeSeries

The screenshot below shows the use of the MergeSeries method for absolute humidity, AH, from the Calperum example site.

The entry under the MergeSeries sub-section is as follows:

source - a comma separated list of variables to be merged to form the AH variable. The variables in this list are merged in order from left to right. In this example, the first variable is AH, the observations. Missing data in the observations are then replaced with AH_aws, or AH_access if the AWS data is missing, or AH_era5 if both AWS and ACCESS data are missing or AH_cli if all 3 are missing.

Editing of the Drivers section

Editing the contents of the Fluxes section is similar to editing other sections in the L4 control file. Items can be added to or removed from the section using a context-sensitive menu that is displayed when the user right clicks on the section or sub-section titles in the Parameter column. Entries in the Value column can be edited by double clicking on the text in the Value column and editing the text.

Removing a Variable

Variables can be removed from the Drivers section by right clicking on the variable name and selecting Remove variable, see the screenshot below.

Adding a New Variable

New variables can be added to the this section by right clicking on the Drivers section title in the Parameter column and selecting Add variable from the displayed context menu, see the screenshot below.

The new variable is added after the last entry in the Drivers section and is given the name <var>. You can change the position of the new variable in the Drivers section by selecting it and dragging to the new location. The new variable is added with all gap filling methods and MergeSeries, see the screenshot below.

Editing a New Variable

Unwanted gap filling methods can be removed by right clicking on the gap filling method section (GapFillFromAlternate, or GapFillFromClimatology) and selecting Remove method, see the screenshot below.

Adding a Method to an Existing Variable

A gap filling method can be added to an existing variable when required. To add a gap filling method to an existing variable, right click on the variable name in the Drivers section and select the gap filling method from the context menu, see the screenshot below.

Running L4

Once the user has finished editing the L4 control file, it can be run by using the Current option of the Run entry on the main menu. The shortcut to run the current control file is Ctrl+R (press and hold down the control key and press the R key).

The GapFillFromAlternate method has a second stage of user input when it is run. The second stage allows the user to specify options for this method, these are explained in the next section.

The GapFillFromAlternate GUI

Running the L4 control file with the GapFillFromAlternate gap filling method will bring up a small GUI that allows the user to specify the options for this gap filling method, see the screenshot below. All of the option fields are filled out with default values that are expected to work well in most situations but it is recommended that the user check the results of this gap filling method and change the options if required.

The top 2 rows show the start and end date of the data set being gap filled.

The bottom row contains 3 buttons. The Run button starts the gap fill process once the user has modified the options if required. The Quit button allows the user to quit from the L4 gap filling process (before Run is pressed). Once the L4 gap filling process has completed and the user is happy with the results, pressing the Done button completes the gap filling process, merges the gap fill data with the observations and writes the L4 output file.

The GapFillFromAlternate GUI allows the user to run the L4 gap filling in either manual mode, where the user specifies the periods to be processed, or automated mode, where PyFluxPro steps sequentially through the data set using the options specified in the GUI.

The Manual radio button controls whether the L4 processing will run in manual or automatic mode.

Manual mode

To run in manual mode:

Check the Manual radio button.
Enter the start and end dates of the period to be gap filled in the Start date and End date text entry boxes.
Click the Run button. PyFluxPro will run the L4 gap filling for the specified period.
You can change the start and end dates and run the L4 processing for as many periods as required.
Click the Done button when all required periods have been gap filled.

Automatic mode

To run in automatic mode:

Check either the Months or the Days radio button depending on the window size you want to specify. The default is Months.
Change the text in the Months or Days text entry boxes to the required window size. The default is 3 months.
Cliick the Run button. PyFluxPro will run through the whole data set using the specified window size.
When PyFluxPro has finished the gap filling and you're happy with the results, click the Done button.

Remaining options

The remaining options on the GapFillFromAlternate GUI are as follows:

Min pts (%) - this is the minimum percentage of good data that must be present in the window being filled. Specifying this value ensures the regression between the flux tower and alternate data is done on an adequate sample of good data.
Auto complete - specifies how to treat periods that do not contain sufficient good data (see Min pts (%)). When Auto complete is checked, PyFluxPro will set aside any periods with less than Min pts (%) good data and continue with the next period. Once all periods have been processed, PyFluxPro will return to those set aside and will extend these in increments of 2 days (start 1 day earlier, end 1 day later) until Min pts (%) is satisfied. This ensures that the shortest window length is used while still satisfying the Min pts (%) criterion.
Show plots - controls display of the plots produced by GapFillUsingSOLO. Plots are not displayed to the screen when this box is unchecked but hard copies of the plots (PNG files) are still produced. By default, PyFluxPro only displays plots when the data period being processed contains gaps. Checking the Plot all check box will cause plots to be displayed for all periods regardless of whether the target contains gaps or not.

Output from Running L4

Files Produced During L4

The L4 processing produces 2 output files:

The L4 netCDF file containing the same variables as the L3 netCDF file but with the gaps in the variables specified in the control file filled using alternate data or climatology.
An Excel workbook containing a sheet for each alternate variable with fit statistics for each period in the gap filling process.

Plots Produced During L4

Plots of the intermediate data produced during the gap filling process are displayed on the screen and hard copies of the plots are saved as PNG files. The type of plots depends on the gap filling method and are described below. Note that hard copies (PNG files) of the plots displayed to the screen are produced by default.

Coverage Plot

The L4 processing always displays a coverage plot of the variables being gap filled before applying the gap fill methods, see the screenshot below.

The coverage plot shows a time line for each variable with gaps in the data represented by gaps in the time lines. The variable names are shown on the left hand Y axis and the percentage of good data present in the variables being gap filled are shown on the right hand Y axis. The X axis is the date. The example shown above, from the Calperum OzFlux site, shows 9 years of data with a large gap (~3 months) in early 2014 when the flux tower and instruments were damaged by a wild fire.

The progress of the L4 gap filling process is shown by thick lines plotted over the thin time lines of the original coverage plot, see the screenshot below.

Variable Plot for Each Window

PyFluxPro produces plots of the flux tower and alternate data for each period used in the gap filling process. A single plot window is updated as the results become available for each variable, see the example below.

The L4 plot consists of several elements:

The top section consists of time series plots of the alternate data (blue lines). The bottom time series plot shows the tower data (red dots) and the fitted alternate data (green line). The example shown is for Ta and has the ERA5 data in the top time series, the ACCESS data in the time series second from the top and the AWS data in the time series third from the top.
The bottom left plot is a scatter plot of the observations (Y axis) versus the gap filling data (X axis). The line of best fit is shown as a dashed, red line and the equation of best fit is given at the top of the scatter plot.
The centre plot in the bottom row is the diel variation of the data for the period shown in the time series. The red dots are the observations, the blue line is the alternate data and the green line is the best fit of the alternate data to the observations.
The bottom right section of the plot gives details of the gap filling parameters used for this variable and period and statistics for the agreement between the gap fill data and the observations.