awips ml usage guide - rmcsqrd/awips-ml Wiki

awips-ml Wiki

This wiki contains information about using awips-ml.

Interacting with the containers

Sometimes it is useful to step into the containers to check on logs, files, etc. To do this, run the following exec command (this command assumes bash is installed in the container being accessed):

docker exec -it [container_name] bash

To view running containers, run docker ps which (if containers are running) should return something like this:

CONTAINER ID   IMAGE              COMMAND                  CREATED          STATUS          PORTS                                                                                                  NAMES
59abc5e97a8a   awips-ml_edex      "/usr/sbin/init"         31 minutes ago   Up 31 minutes   0.0.0.0:388->388/tcp, :::388->388/tcp, 0.0.0.0:9581-9582->9581-9582/tcp, :::9581-9582->9581-9582/tcp   edexc
e060cf1bb651   awips-ml_tf        "/usr/bin/tf_serving…"   31 minutes ago   Up 31 minutes   0.0.0.0:8500-8501->8500-8501/tcp, :::8500-8501->8500-8501/tcp                                          tfc
f3bfa05deec5   awips-ml_process   "python server/edex_…"   31 minutes ago   Up 31 minutes                                                                                                          processc

You can find [container_name] in the NAMES column of the output.

Modifying the containers

In general, anytime you modify any files, the containers need to be rebuilt. To do this, run the following commands:

docker-compose down
docker-compose build
docker-compose up

A heavier duty/more "blunt" way to do this is docker system prune which will delete several types of information stored by docker. See documentation here.

Customization

awips-ml provides several interfaces for users to customize the data pre/post-processing within the processc container. These interfaces are found in the /usr/ folder and there functionality is described below:

awips-ml also offers two ways for user's to customize the machine learning model endpoint that is deployed in tfc. Users can:

  1. Include a model generating script in tfc/etc that generates the model from scratch, or
  2. Include the pre-trained model weights from a model.save() command in the tfc/user_model folder.

More instructions on how to use this functionality is given in tfc/Dockerfile.

Sometimes it is useful to expose the TensorFlow model for testing purposes. Currently awips-ml uses a docker network for all intra-container networking. This means that no ports are visible outside of the docker network namespace by default. To expose the model ports in the tfc container, add this line to the docker-compose.yml file under the tf section:

ports:
  - 8500:8500
  - 8501:8501

This will allow users to send data from the host OS (outside of the docker network namespace) over these ports: 8500 for the REST API, 8501 for gRPC. awips-ml uses the REST API.

Configuration

awips-ml is composed of three containers and some other directories which are all configurable according to user needs.

edexc

This container has several configuration files that control the type of data ingested by EDEX and CAVE specific configuration. These files are all found in edexc/etc/conf - files in edexc/etc/systemd should not be modified by users. Unless noted below, files in edexc/etc/conf should not be edited by users:

ldmd.conf

This file controls the type of data ingested by the EDEX container. Note that several example entries are commented out. Users should modify this file so that the EDEX container ingests relevant data.

Modifications can be made by uncommenting an existing line or adding their own. The string in quotes is a regex statement that matches patterns on the upstream LDM. For example:

REQUEST UNIWISC|NIMAGE "OR_ABI-L2-CMIPM1-M6C09_G17.*" iddc.unidata.ucar.edu      # GOES Channel 9 Mesoscale 1

Is requesting OR_ABI-L2-CMIPM1-M6C09_G17.* all GOES 17 (G17) Advanced Baseline Imager (ABI) Level 2 (L2) products with product name Cloud & Moisture Imagery (CMIP) in the Mesoscale 1 (M1) ABI scene. Channel 09 (M6C09) is the specific channel being requested which corresponds to Mid-level water vapor. Info on file naming conventions for ldmd.conf can be found at the following links:

The upstream LDM which the EDEX container gets data from is specified by iddc.unidata.ucar.edu. Users must select an upstream LDM that is willing to serve them data.

pqact.conf

The pqact.conf file handles actions as the EDEX container ingests new data from the upstream LDM. Documentation on this file can be found here. A relevant example for GOES cloud and moisture data is:

NIMAGE  ^/data/ldm/pub/native/satellite/GOES/([^/]*)/Products/CloudAndMoistureImagery/([^/]*)/([^/]*)/([0-9]{8})/([^/]*)(c[0-9]{7})(..)(.....)_ml.nc
    FILE    -close -edex    /awips2/data_store/GOES/\4/\7/CMI-IDD/\5\6\7\8_ml.nc4  # handle inputs for awips-ml

NIMAGE  ^/data/ldm/pub/native/satellite/GOES/([^/]*)/Products/CloudAndMoistureImagery/([^/]*)/([^/]*)/([0-9]{8})/([^/]*)(c[0-9]{7})(..)(.....).nc
    EXEC    /home/awips/anaconda3/envs/grpc_env/bin/python /server/trigger.py /awips2/data_store/GOES/\4/\7/CMI-IDD/\5\6\7\8.nc4 edex_container

Note that the two entries have similar pattern matching with different commands as described in the pqact.conf documentation linked above. The major difference here is the inclusion of the EXEC entry which calls a python script that alerts the EDEX container of a newly recieved file and sends it to the tfc container.

registry.xml

Use this filename to change the hostname:

<hostname>[name].docker</hostname>

tfc

The tfc container is designed to be lightweight in the sense that users only need to point to the location of their trained model. Users can do this by modifying tfc/Dockerfile:

COPY ./tfc/models/[saved_model] /models/model

Where [saved_model] is the location of the model they'd like to serve with the tfc container. Note that [saved_model] must conform to this directory structure:

[saved_model]/[version_number]/

because the underlying TensorFlow docker image in tfc needs a version number to run.

processc

This container does not have any configuration options associated with it.

server

This folder contains several configuration files/scripts used for handling data I/O from the edexc/EDEX server. Users do not need to modify the container_servers.py or trigger.py files directly as these can be controlled with config.yaml

config.yaml

The main parameter to change in this file is variable_spec - this is the netCDF variable that is passed between edexc and processc (and eventually tfc).

Besides this, config.yaml controls several aspects of the inter-container networking and which ports the edexc and processc containers communicate with each other; in general these ports do not need to be modified as they are restricted to the docker network namespace so they shouldn't interfere with the host OS's network namespace.

Troubleshooting

This section covers common problems. If your question is not answered here, feel free to open a new issue for help.

What should I do if:

No data is available in the CAVE Product Browser:
docker exec -it edexc bash
less /awips2/ldm/logs/ldmd.log

Within this log file, you should see something similar to:

20211021T183357.073945Z iddc.unidata.ucar.edu[1111] requester6.c:make_request:311       NOTE  Upstream LDM-6 on iddc.unidata.ucar.edu is willing to be a primary feeder

If you do not see a message like this, that means that whatever upstream LDM specified in the ldmd.conf file is rejecting your requests. Generally this means your IP address is being rejected. Contact the upstream LDM administrator for more information. In the case of Unidata LDM's, your IP address needs to be associated with a .edu domain.

If the above conditions are true, try entering the container and running edex status. If the output looks like:

[edex status]
 postgres    :: running :: pid 188
 pypies      :: running :: pid 268
 qpid        :: running :: pid 305
 EDEXingest  :: running :: pid 777 1906
 EDEXgrib    :: not running
 EDEXrequest :: running :: pid 739 1919
 ldmadmin    :: not running

This could be indicative that the container was not shut down properly before being restarted. In this case, bring down the container (docker-compose down), delete any stored data (docker system prune), and then rebuild/relaunch the container (docker-compose build && docker-compose up).

My containers keep crashing

Generally it is convenient to launch a container in detached mode (docker-compose up -d), however this means that you can't see the output of the container. If your container is crashing it can be convenient to launch the container normally (docker-compose up) and view the output (especially for the processc/tfc containers).

Additionally it can be useful to look at the outputs of the containers themselves by attaching to the container process launched by docker-compose; you can do this via:

docker attach [container_name]
File Loads in Product Browser but doesn't Display

If your data has been successfully transformed in your machine learning model and shows up in CAVE's Product Browser but nothing displays in the map view (except potentially a color bar) then you may need to clear CAVE's cache. This can be done by deleting the caveData directory (listed below, more information on this here)

docker-compose error: Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted

This occurs if you are using a docker version >v4.3.0 which introduced a breaking change for awips-ml. awips-ml uses centos7 with no current plans to upgrade to centos8. Docker versions <v4.3.0 should work and awips-ml was developed using Docker v3.5.2. Downgrading to different Docker versions is possible by going to their website; the downgrade link to Docker v3.5.2 is here.

Stuff just doesn't work

File an issue (ideally with a link to your forked awips-ml repository). Useful places to look for logs within the edexc container are:

sudo journalctl -fu listener_start.service