awips ml usage guide - Unidata/awips-ml GitHub Wiki

awips-ml Wiki

This wiki contains information about using awips-ml.

Interacting with the containers

Sometimes it is useful to step into the containers to check on logs, files, etc. To do this, run the following exec command (this command assumes bash is installed in the container being accessed):

docker exec -it [container_name] bash

To view running containers, run docker ps which (if containers are running) should return something like this:

CONTAINER ID   IMAGE              COMMAND                  CREATED          STATUS          PORTS                                                                                                  NAMES
59abc5e97a8a   awips-ml_edex      "/usr/sbin/init"         31 minutes ago   Up 31 minutes   0.0.0.0:388->388/tcp, :::388->388/tcp, 0.0.0.0:9581-9582->9581-9582/tcp, :::9581-9582->9581-9582/tcp   edexc
e060cf1bb651   awips-ml_tf        "/usr/bin/tf_serving…"   31 minutes ago   Up 31 minutes   0.0.0.0:8500-8501->8500-8501/tcp, :::8500-8501->8500-8501/tcp                                          tfc
f3bfa05deec5   awips-ml_process   "python server/edex_…"   31 minutes ago   Up 31 minutes                                                                                                          processc

You can find [container_name] in the NAMES column of the output.

Modifying the containers

In general, anytime you modify any files, the containers need to be rebuilt. To do this, run the following commands:

docker-compose down
docker-compose build
docker-compose up

A heavier duty/more "blunt" way to do this is docker system prune which will delete several types of information stored by docker. See documentation here.

Customization

awips-ml provides several interfaces for users to customize the data pre/post-processing within the processc container. These interfaces are found in the /usr/ folder and there functionality is described below:

  • environment.yml: This is an environment.yml file that users can use to include their own custom conda environment within the container.
  • preproc.py: This is a custom pre-processing script that is invoked each time a new file is downloaded into the processc container prior. This script takes the numpy array from the transmitted netCDF file and corresponding variable. The output of this script should be a numpy array that matches the expected input dimensions of the tensorflow model hosted in tfc.
  • postproc.py: This is a custom post-processing script that is invoked each time model output is returned from the tfc container machine learning model. This script takes the outputted numpy array from the machine learning model. The output of this script should be a numpy array with dimensions that match the original netCDF file downloaded from the EDEX container.
  • custom_processc_script.sh: This is a bash script that allows users to include their own custom bash scripts to modify the processc container to accommodate their data processing workflows.

awips-ml also offers two ways for user's to customize the machine learning model endpoint that is deployed in tfc. Users can:

  1. Include a model generating script in tfc/etc that generates the model from scratch, or
  2. Include the pre-trained model weights from a model.save() command in the tfc/user_model folder.

More instructions on how to use this functionality is given in tfc/Dockerfile.

Sometimes it is useful to expose the TensorFlow model for testing purposes. Currently awips-ml uses a docker network for all intra-container networking. This means that no ports are visible outside of the docker network namespace by default. To expose the model ports in the tfc container, add this line to the docker-compose.yml file under the tf section:

ports:
  - 8500:8500
  - 8501:8501

This will allow users to send data from the host OS (outside of the docker network namespace) over these ports: 8500 for the REST API, 8501 for gRPC. awips-ml uses the REST API.

Configuration

awips-ml is composed of three containers and some other directories which are all configurable according to user needs.

  • edexc: this is the container that runs the actual EDEX server.
  • processc: this is the container that takes data ingested by edexc and preprocesses before sending to tfc and post-processes data recieved from tfc before sending back to edexc.
  • tfc: this is the container where the TensorFlow machine learning model exists.
  • server: this directory has several common utilities used by different containers. Unless noted below, files in this directory should not be modified by users.
  • docker-compose.yml: This file controls how edexc, processc, and tfc are launched/interact with each other. In general user configuration should not be necessary. In general/where possible user configurations exist in specific files. Users should (in general) not need to modify any Dockerfile files.

edexc

This container has several configuration files that control the type of data ingested by EDEX and CAVE specific configuration. These files are all found in edexc/etc/conf - files in edexc/etc/systemd should not be modified by users. Unless noted below, files in edexc/etc/conf should not be edited by users:

ldmd.conf

This file controls the type of data ingested by the EDEX container. Note that several example entries are commented out. Users should modify this file so that the EDEX container ingests relevant data.

Modifications can be made by uncommenting an existing line or adding their own. The string in quotes is a regex statement that matches patterns on the upstream LDM. For example:

REQUEST UNIWISC|NIMAGE "OR_ABI-L2-CMIPM1-M6C09_G17.*" iddc.unidata.ucar.edu      # GOES Channel 9 Mesoscale 1

Is requesting OR_ABI-L2-CMIPM1-M6C09_G17.* all GOES 17 (G17) Advanced Baseline Imager (ABI) Level 2 (L2) products with product name Cloud & Moisture Imagery (CMIP) in the Mesoscale 1 (M1) ABI scene. Channel 09 (M6C09) is the specific channel being requested which corresponds to Mid-level water vapor. Info on file naming conventions for ldmd.conf can be found at the following links:

The upstream LDM which the EDEX container gets data from is specified by iddc.unidata.ucar.edu. Users must select an upstream LDM that is willing to serve them data.

pqact.conf

The pqact.conf file handles actions as the EDEX container ingests new data from the upstream LDM. Documentation on this file can be found here. A relevant example for GOES cloud and moisture data is:

NIMAGE  ^/data/ldm/pub/native/satellite/GOES/([^/]*)/Products/CloudAndMoistureImagery/([^/]*)/([^/]*)/([0-9]{8})/([^/]*)(c[0-9]{7})(..)(.....)_ml.nc
    FILE    -close -edex    /awips2/data_store/GOES/\4/\7/CMI-IDD/\5\6\7\8_ml.nc4  # handle inputs for awips-ml

NIMAGE  ^/data/ldm/pub/native/satellite/GOES/([^/]*)/Products/CloudAndMoistureImagery/([^/]*)/([^/]*)/([0-9]{8})/([^/]*)(c[0-9]{7})(..)(.....).nc
    EXEC    /home/awips/anaconda3/envs/grpc_env/bin/python /server/trigger.py /awips2/data_store/GOES/\4/\7/CMI-IDD/\5\6\7\8.nc4 edex_container

Note that the two entries have similar pattern matching with different commands as described in the pqact.conf documentation linked above. The major difference here is the inclusion of the EXEC entry which calls a python script that alerts the EDEX container of a newly recieved file and sends it to the tfc container.

registry.xml

Use this filename to change the hostname:

<hostname>[name].docker</hostname>

Additionally, the line

<time-offset>0</time-offset>

has a different value than the default (3600, more information here) because it was causing inconsistent behavior for EDEX file ingestion.

tfc

The tfc container is designed to be lightweight in the sense that users only need to point to the location of their trained model. Users can do this by modifying tfc/Dockerfile:

COPY ./tfc/models/[saved_model] /models/model

Where [saved_model] is the location of the model they'd like to serve with the tfc container. Note that [saved_model] must conform to this directory structure:

[saved_model]/[version_number]/

because the underlying TensorFlow docker image in tfc needs a version number to run.

processc

This container does not have any configuration options associated with it.

server

This folder contains several configuration files/scripts used for handling data I/O from the edexc/EDEX server. Users do not need to modify the container_servers.py or trigger.py files directly as these can be controlled with config.yaml

config.yaml

The main parameter to change in this file is variable_spec - this is the netCDF variable that is passed between edexc and processc (and eventually tfc).

Besides this, config.yaml controls several aspects of the inter-container networking and which ports the edexc and processc containers communicate with each other; in general these ports do not need to be modified as they are restricted to the docker network namespace so they shouldn't interfere with the host OS's network namespace.

Troubleshooting

This section covers common problems. If your question is not answered here, feel free to open a new issue for help.

What should I do if:

No data is available in the CAVE Product Browser:
  • Try waiting for a few minutes to see if data loads - sometimes there is a lag between launching the EDEX container and when data is available.
  • If no data eventually appears, interact with the container and check the LDM log by:
docker exec -it edexc bash
less /awips2/ldm/logs/ldmd.log

Within this log file, you should see something similar to:

20211021T183357.073945Z iddc.unidata.ucar.edu[1111] requester6.c:make_request:311       NOTE  Upstream LDM-6 on iddc.unidata.ucar.edu is willing to be a primary feeder

If you do not see a message like this, that means that whatever upstream LDM specified in the ldmd.conf file is rejecting your requests. Generally this means your IP address is being rejected. Contact the upstream LDM administrator for more information. In the case of Unidata LDM's, your IP address needs to be associated with a .edu domain.

If the above conditions are true, try entering the container and running edex status. If the output looks like:

[edex status]
 postgres    :: running :: pid 188
 pypies      :: running :: pid 268
 qpid        :: running :: pid 305
 EDEXingest  :: running :: pid 777 1906
 EDEXgrib    :: not running
 EDEXrequest :: running :: pid 739 1919
 ldmadmin    :: not running

This could be indicative that the container was not shut down properly before being restarted. In this case, bring down the container (docker-compose down), delete any stored data (docker system prune), and then rebuild/relaunch the container (docker-compose build && docker-compose up).

My containers keep crashing

Generally it is convenient to launch a container in detached mode (docker-compose up -d), however this means that you can't see the output of the container. If your container is crashing it can be convenient to launch the container normally (docker-compose up) and view the output (especially for the processc/tfc containers).

Additionally it can be useful to look at the outputs of the containers themselves by attaching to the container process launched by docker-compose; you can do this via:

docker attach [container_name]
File Loads in Product Browser but doesn't Display

If your data has been successfully transformed in your machine learning model and shows up in CAVE's Product Browser but nothing displays in the map view (except potentially a color bar) then you may need to clear CAVE's cache. This can be done by deleting the caveData directory (listed below, more information on this here)

  • macOS: /Users/[username]/Library/caveData
  • Linux: /home/[username]/caveData
  • Windows: C:\Users\[username]\caveData
docker-compose error: Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted

This occurs if you are using a docker version >v4.3.0 which introduced a breaking change for awips-ml. awips-ml uses centos7 with no current plans to upgrade to centos8. Docker versions <v4.3.0 should work and awips-ml was developed using Docker v3.5.2. Downgrading to different Docker versions is possible by going to their website; the downgrade link to Docker v3.5.2 is here.

Stuff just doesn't work

File an issue (ideally with a link to your forked awips-ml repository). Useful places to look for logs within the edexc container are:

  • awips2/edex/logs/edex-ingest-[product_type]-[date].log
  • awips2/ldm/logs/ldmd.log
  • The output of the python script handling communication between edexc and tfc can be viewed via the following command within the edexc container:
sudo journalctl -fu listener_start.service
⚠️ **GitHub.com Fallback** ⚠️