Using EAT for data assimilation with GOTM - BoldingBruggeman/eat Wiki

WARNING: this page describes a pre-release (alpha) version of EAT. Functionality, names of executables and APIs, and the arguments they take can all still change before the final public release.

Installing

Requirements

Instead of manually installing these different components on your system, it is often easier to install all at once using Anaconda. This has the additional benefit of guaranteeing all packages are compatible. Installation with Anaconda is discussed in the following section. If you have already made sure your system meets all above requirements, you can skip most of the steps below and start with bash ./install.

Using conda

You will need the Anaconda Python distribution. On many systems that is already installed: try running conda --version. If that fails, you may need to load an anaconda module first: try module load anaconda or module load anaconda3. If that still does not give you a working conda command, you may want to install Miniconda.

Now make sure your conda environment is initialized:

conda init bash

This needs to be done just once, as it modifies your .bashrc that is sourced every time you login. After this, restart your shell by logging out and back in.

Now obtain the repository with setups and scripts:

git clone --recursive https://github.com/BoldingBruggeman/eat.git
cd eat
conda env create -f environment.yml
conda activate eat
bash ./install

Note that the above compiles EAT with GOTM and FABM in their default configuration. If you want to customize this, e.g., by including external codebases with additional biogeochemical models in FABM, you can specify additional arguments to cmake by editing install. Typically you would change the line with cmake ${REPO_DIR} to something like:

cmake ${REPO_DIR} -DFABM_INSTITUTES="<NAME1>;<NAME2>;..." -DFABM_<NAME1>_BASE=<DIR1> -DFABM_<NAME2>_BASE=<DIR2> ...

If required, you can also point EAT to use another FABM codebase by adding -DFABM_BASE=<FABMDIR> at the end of the same line.

Staying up to date

To update this repository including its submodules (PDAF, GOTM), make sure you are in the eat directory and execute:

git pull --recurse-submodules
git submodule update --init --recursive
conda env update -f environment.yml
conda activate eat
bash ./install

If you do not use Anaconda, you can skip the two conda commands in the above.

Performing a GOTM simulation with data assimilation

To begin, you need a working GOTM setup and observations that you would like to assimilate.

If you have used conda to set up EAT, always start with activating the EAT Python environment: conda activate eat.

EAT is MPI-based and uses separate processes for observation handling, the data assimilation filter, and the ensemble members that perform the model simulation. You start all of these at once with the mpiexec command:

mpiexec -n 1 eatpy-gotm-obs --start "2016-01-01 12:00:00" --stop "2019-12-31 12:00:00" -o temp[-1] cci_sst.dat  : -n 1 eatpy-filter : -n 10 eat_model_gotm

The different components in this line are explained in the sections below.

If you run on an HPC cluster, you typically will want to put the above statement in a job submission script, so that the simulations run on the compute nodes rather than on the login nodes. An example of a SLURM script that does this:

#!/bin/bash
#SBATCH --account=tra21_seamless
#SBATCH --partition=g100_usr_prod
#SBATCH --ntasks 12
#SBATCH --time=01:00:00
#SBATCH --output=%j.out
#SBATCH --error=%j.err

conda activate eat
export PYTHONUNBUFFERED=TRUE
mpiexec -n 1 eatpy-gotm-obs --start "2016-01-01 12:00:00" --stop "2019-12-31 12:00:00" -o temp[-1] cci_sst.dat  : -n 1 eatpy-filter : -n 10 eat_model_gotm

Note that the allocated number of task (#SBATCH --ntasks 12) must match the number of processes started by mpiexec (twelve in total: 1 for eatpy-gotm-obs, 1 for eatpy-filter, 10 for eat_model_gotm).

Setting up an initial ensemble

Ensemble members can differ in their initial state, which is set via GOTM's restart file (restart.nc in a stand-alone simulation), or in their configuration (gotm.yaml in a stand-alone simulation). The model is told which of these two options (or both!) is to be used with a namelist file:

&nml_eat_model
   !default verbosity is info
   !verbosity=2048 ! silent=64, fatal=128, error=256, warn=512, info=1024, debug=2048
   !all model members are verbose
   !all_verbose=.true. ! default is .false.
   shared_gotm_yaml=.true.
   shared_restart_file=.true.
/

This file must be inside your GOTM setup directory and be named eat_gotm.nml.

In the above, shared_gotm_yaml specifies whether all ensemble members share the same configuration (gotm.yaml), whereas shared_restart_file specifies whether all ensemble members share the same initial state. If you set shared_gotm_yaml to .false., GOTM will looks for one configuration file per ensemble member; these need to be named gotm_<NNNN>.yaml, where <NNNN> is 0001 for the first ensemble member, 0002 for the second, etc. If you set shared_restart_file to .false., GOTM will looks for one restart file per ensemble member; these need to be named restart_<NNNN>.nc, where <NNNN> is 0001 for the first ensemble member, 0002 for the second, etc.

Note that if you want to use different restarts for the different ensemble members, GOTM needs to be configured to use restarts: set restart/load to true in gotm.yaml.

Generating restart files

If you want to use different initial states per ensemble member (shared_restart_file=.false. in eat_gotm.nml), you first need a restart file (restart.nc) for your stand-alone simulation. If you do not have this file yet, you can generate it by first running (part of) your setup with the normal GOTM executable. If you want the restart to reflect the model state at the very start, you should (temporarily!) set the stop time equal to the start time in gotm.yaml for the simulation in which you generate the restart file.

After you have the initial restart file, you can generate perturbation versions of it for each ensemble member. There are many different ways this could be done. EAT contains a simple example in which independent scale factor are applied to every variable that is part of the model state (and at every depth, for depth-explicit variables). These scale factors are drawn from a log-normal distribution. To try this, use the eatpy-gotm-gen tool, like this:

eatpy-gotm-gen restart.nc <N> --exclude z --exclude zi --exclude h

In the above, <N> specifies the number of ensemble members, which would be 10 for the example above. --exclude (or -e) specifies variables that should not be perturbed. Typically you will want to do this for variables associated GOTM's vertical coordinate, that is, z, zi and h as in the above.

WARNING: upon completion, GOTM simulations overwrite the initial restart file(s) with the state of the model at the end of the simulation! Thus, you typically want to store your restart files separately, for instance in a subdirectory within your GOTM setup. You then copy these to your setup directory every time you want to start a simulation.

Observations

File format

Each type of observed variable (e.g., temperature, chlorophyll, oxygen concentration) must be described in a separate file. The format of this file depends on whether the variable is depth-independent (e.g., sea surface temperature, benthic detritus) or depth-dependent.

Depth-independent variables

Each line of the observation file must contain tab-separated values for the time (GMT), the observed variable, and its standard deviation. For instance:

2015-05-01 12:00:00	11.12	0.32

This could describe an observed sea surface temperature of 11.12 °C (s.d. 0.32 °C) on 1 May 2015, at noon.

Depth-dependent variables

Each line of the observation file must contain tab-separated values for the time (GMT), the depth at which the variable was observed (in meters above mean sea level), the observed variable, and its standard deviation. For instance:

2015-05-01 12:00:00	-15.	11.12	0.32

This could describe an observed temperature of 11.12 °C (s.d. 0.32 °C) at a depth of 15 m below mean sea level, on 1 May 2015, at noon. Note that as in GOTM, the depth axis is directed upwards, with a reference depth equal to mean sea level. The observed depth will therefore nearly always be negative (i.e., below mean sea level); the only exception is when sampling near the surface during high tide.

Specifying the set of observed variables

When you perform data assimilation, the set of observations to use is set via command line arguments to eatpy-gotm-obs. Each observed variable is specified with a separate --obs argument (or -o in short), which must followed by the name of the variable as it is known to GOTM-FABM (the same name that would be used for that variable in GOTM's own NetCDF output), and then the path of the observation file. For instance:

-o temp[-1] cci_sst.dat

Note that in this case, the variable temp is followed by a depth index ([-1]). As usual in Python, negative indices count down from the end of the array. Thus -1 specifies the last depth index. Since GOTM orders its depth axis from bottom to surface, this indicates the value of temp closest to the surface - i.e., the sea surface temperature. Note that because of this depth indexing, the cci_sst.dat file is expected to be depth-independent.

Additional options

You can direct the observation handler (eatpy-gotm-obs) to only use observations from a specific time period. This can be useful if your observation file contains a time series that extends beyond the time period covered by your GOTM setup, or if you want to include a period during the simulation in which the model runs freely, for instance, in forecast-only mode after a period of data assimilation. To configure this, you can specify --start and/or --stop:

eatpy-gotm-obs --start "2016-01-01 12:00:00" --stop "2016-12-31 12:00:00"

The data assimilation method

Currently the PDAF data assimilation method is configured in the file eat_pdaf.nml, which must be present in your GOTM setup directory. By default, this file contains:

&nml_filter_pdaf
   !default verbosity is info
   !verbosity=2048 ! silent=64, fatal=128, error=256, warn=512, info=1024, debug=2048
/

&nml_config_pdaf
   screen=1,
   filtertype=1,
   subtype=0
/

3DVar

To use PDAF with 3D-Var, you need to provide custom routines for forward and adjoint covariance transformation (or their ensemble counterparts). This is done through a plugin, implemented in a stand-alone Python file with the following template:

import numpy
import eatpy.filter

class Cvt(eatpy.filter.CvtHandler):

    def initialize(self, variables: Mapping[str, Any], ensemble_size: int):
        # Here you might add routines that read the square root of the error covariance matrix (Vmat_p) from file
        self.Vmat_p = ...

    def cvt(self, iter: int, state: numpy.ndarray, v_p: numpy.ndarray) -> numpy.ndarray:
        """Forward covariance transformation for parameterized 3D-Var"""
        return self.Vmat_p @ v_p

    def cvt_adj(self, iter: int, state: numpy.ndarray, Vv_p: numpy.ndarray) -> numpy.ndarray:
        """Adjoint covariance transformation for parameterized 3D-Var"""
        return self.Vmat_p.T @ Vv_p

    def cvt_ens(self, iter: int, state: numpy.ndarray, v_p: numpy.ndarray) -> numpy.ndarray:
        """Forward covariance transformation for ensemble 3D-Var"""
        if iter == 1:
            Vcv_p = state.mean(axis=0)
            self.Vmat_ens_p = numpy.repeat((state - Vcv_p) / numpy.sqrt(v_p.size - 1), mcols_cvec_ens, axis=0)
        return self.Vmat_ens_p @ v_p

    def cvt_adj_ens(self, iter: int, state: numpy.ndarray, Vv_p: numpy.ndarray) -> numpy.ndarray:
        """Adjoint covariance transformation for ensemble 3D-Var"""
        return self.Vmat_ens_p.T @ Vv_p

To make EAT use your plugin, you need to specify it on the command line when calling eatpy-filter:

eatpy-filter --plugin cvt.Cvt

Here, we assume the plugin is implemented in the file cvt.py, placed in the same directory from where eatpy-filter is run.

Plugging into the filter to add new functionality

The observations and the model state come together in the filter component (eatpy-filter). This is therefore the natural place to insert new functionality, for instance, applying transforms to selected variables (log, logit, etc.), saving or analyzing the model state before and after the data assimilation step, or preventing selected model variables from being changed by the filter. This is possible through plugins.

The filter component is written in Python (it in turn wraps PDAF for the actual filtering algorithms) and allows you to hook in your own custom Python code that receives all information on the observations, the model state and the filter - during every data assimilation step. This custom code is allowed to modify the model state and/or the observations in place; it can thus be used for a wide range of purposes.

An example of a relatively simple plug-in that prints statistics of temperature across the ensemble to screen:

from typing import Mapping, Any
import datetime
import numpy
import eatpy.shared

class TestPlugin(eatpy.shared.Plugin):
    def initialize(self, variables: Mapping[str, Any], ensemble_size: int):
        self.logger.info('The model state consists of the following variables: {}'.format(', '.join(variables)))
        self.temp_start = variables['temp']['start']
        self.temp_stop = self.temp_start + variables['temp']['length']

    def before_analysis(self, time: datetime.datetime, state: numpy.ndarray, iobs: numpy.ndarray, obs: numpy.ndarray, filter: eatpy.shared.Filter):
        temp = state[:, self.temp_start:self.temp_stop]  # note: the first axis is for the ensemble members
        sd = numpy.std(temp, axis=0)
        self.logger.info('{} before analysis:'.format(time))
        self.logger.info('  Temperature range: {} - {}'.format(temp.min(), temp.max()))
        self.logger.info('  Temperature s.d. across the ensemble: {} - {}'.format(sd.min(), sd.max()))

    def after_analysis(self, state: numpy.ndarray):
        temp = state[:, self.temp_start:self.temp_stop]  # note: the first axis is for the ensemble members
        sd = numpy.std(temp, axis=0)
        self.logger.info('after analysis:')
        self.logger.info('  Temperature range: {} - {}'.format(temp.min(), temp.max()))
        self.logger.info('  Temperature s.d. across the ensemble: {} - {}'.format(sd.min(), sd.max()))

The above methods in short:

You should place your plug-in code in a separate Python file (i.e., outside EAT), in a location that can be accessed by Python when run from your GOTM setup directory. You can then direct the filter component to activate your plugin using the --plugin (-p) command line argument. For instance, if your script is called test_plugin.py and it is located inside your GOTM setup directory, you can run the filter with:

eatpy-filter --plugin test_plugin.TestPlugin

Here, test_plugin is the name of your Python module (the name of your Python file), whereas TestPlugin is the name of your plug-in class inside this file. The latter must inherit from eatpy.shared.Plugin. In the above example, the plug-in is specified as a Python class. You can also specify it as an instance of such a class, in which case you can provide arguments to your class. Examples of this are given in the next section.

If you use multiple plugins, their before_analysis methods will be called in the order you specify the plug-ins on the command line. This is important, as individual plugins may modify the value of the model state or the observations. The next plug-in in line will then see that modified state. The order of the plug-ins calls is reversed when after_analysis is called, which means that you could use an "inner" plugin (specified last on the command line) to e.g. log-transform a state variable (i.e., the forecast state and observations are transformed in before_analysis; the transform is then reversed for the analysis state in after_analysis), while all other plugins see the untransformed values of the variable.

Example plug-ins

EAT already includes two plug-ins: