awips ml design document - rmcsqrd/awips-ml Wiki

awips-ml Design Document

This document is targeted at developers and provides an overview of how awips-ml works internally. See diagram below for visual description of how it works.

Container Overview

awips-ml is composed of three different docker containers described below. The quickstart and awips-ml guide describe how to run the containers using docker-compose. The three containers are:

Container Configuration

Several user specific customization options are included for awips-ml. There are two primary locations:

Both locations and how to customize their contents are described in the awips-ml usage guide.

Container Networking

This section describes how data is transferred between containers. Generally, awips-ml has two intra-container networking methods that need to be dealt with. The data transfer in/out of the edexc container is handled internally by the EDEX server.

For intra-container networking, docker-compose launches a docker network called awipsml-net for all network communication. The ports that the containers communicate via are defined in /usr/config.yaml. Note that all ports/hostnames defined in this file are defined as opposites. Generally this file does not need to be changed unless something precludes using the default ports. The only exception to this is:

edexcprocessc networking

Data is transferred between edexc and processc using pygcdm which is based on gCDM which is the Java implementation of gRPC for the Common Data Model (more of this is discussed in the README for pygcdm). At a high level, this allows netCDF data to be transferred more efficiently than via HTTP.

pygcdm (and gCDM) is implemented such that the transaction is initiated by sending a request message to a server. The server then responds with a response message. As shown in the diagram above, edexc sends header/data requests to processc and receives header/data responses (and vice-versa).

Because response messages can only be sent in response to a request, this means that the edexc container needs a way of prompting processc to request data whenever new data is ingested from the LDM. To accomplish this, edexc sends a string containing the local file path of the newly ingested data (via socket). This message is received into an asynchronous queue in processc. When processc is ready, it pops the file path off the queue and uses this as part of it's request message to edexc. When processc has the transformed data from tfc it uses a similar process to prompt edexc to request the data for visualization via CAVE.

edexc starts the socket "trigger" process when new data is ingested from the LDM. When this data is ingested, it triggers an EXEC function (defined here) that calls a python trigger function (defined here) that sends the file path via socket. Data being ingested into the EDEX is handled by a utility included with the EDEX install (located within the edexc container at /awips2/ldm/dev/notifyAWIPS2-unidata.py).

The mechanism for the processc trigger process is more simple. It asynchronously waits to pop data from the trigger queue, requests the data, sends it to tfc then sends the file path of the transformed data back to tfc immediately. It doesn't rely on any existing internals like edexc does so is a more simple implementation.

processctfc networking

The processc networking with tfc is much less complex than with edexc. It transfers numpy arrays via HTTP to a specified port on the tfc container and receives the ML model output via HTTP.

/server/

This folder is the "bread and butter" of awips-ml and is the location where the networking and processing code lives. When docker-compose starts the containers and docker network, it also starts these functions within the appropriate containers. There are three files in this folder which are discussed in this section.

container_servers.py

This code is what allows edexc and processc to listen for and respond to data requests via pygcdm. The code contains a BaseServer task which implements the low level grpc and trigger functionality. ProcessContainerServer and EDEXContainerServer inherit from the BaseServer class and implement container specific functionality.

Both classes read from the usr/config.yaml file when instantiated and use it's contents for deciding which ports to request/respond on etc. Networking is generally implemented using asynchronous python built-ins where possible to allow for edexc and processc to theoretically be deployed on different machines.

grpc_api.py

This code is a low-level wrapper around the pygcdm library. container_servers.py imports this code and uses it for making header/data requests and sending header/data responses.

trigger.py

This code is a small convenience function that is invoked by the edexc container to send a message containing the file path of the newly ingested data when a new file is ingested via EDEX. processc has this trigger functionality implemented in the ProcessContainerServer class. The reason it is broken out for edexc is because any trigger event occurs via the EDEX server calling the EXEC command (described previously) when a new file is ingested.

Docker Specific Stuff

This section describes the nuances of the specific containers including a discussion of the Dockerfile's. Discussion of what is going on in the docker-compose.yml file is also provided.

docker-compose.yml

This file is generally pretty simple to understand. At a high level, this launches all three awips-ml containers by collating all the individual run instructions. Theoretically each container could be launched independently via a docker run command with lots of options; the docker-compose command abstracts the need for this into one file. Reference the docker-compose documentation for general information, specific nuances in this file are:

edexc

This is the most complex docker container in awips-ml. The Dockerfile does several things, listed below in approximate order:

  1. Install an EDEX server into the container
  2. Install conda into the container
  3. Creates a conda environment with the necessary dependencies. This conda environment is used to run server/container_server.py edex_container; the reason the default container python executable is not used is to avoid conflicts because the EDEX server relies on python 2 but pygcdm requires python 3.
  4. Modify EDEX config files (described in the user guide)
  5. Setup the container to run systemd in the container
  6. Copy in the awips-ml specifc systemd init service files
  7. Clean up the container

Besides the config files found in /edexc/etc/conf/ which are discussed in the awips-ml user guide, the other config files are systemd specific and can be found in edexc/etc/systemd/. The EDEX install script (awips_install.sh) relies on having centos7 as the operating system and no workaround was found to emulate centos7 in a container without including systemd. As such, systemd is leveraged to launch some services on init; running systemd in a container is generally frowned up in Docker but no other workaround was discovered. These service files are:

tfc

This is a simple container that just starts from the TensorFlow model serving development image and deploys a model. See the usage guide for how to specify a non-dummy model. The build_dummy_model.py script is intended as a place holder to cut down on repo size instead of having a saved model.

processc

This is another simple container that installs the dependencies for the server/container_servers.py code and also installs conda so that users can use their own custom conda environment. Additionally it runs any custom shell script defined by the user. Launching server/container_servers.py process_container is handled by the docker-compose.yml file.

Known Issues

"Queue: \'external.dropbox\' not found"

There is a lag between the startup of the LDM and the EDEX server within the edexc container; this means that data is downloaded from the upstream LDM but is not ingested into the EDEX server. Because the server/container_servers.py script starts instantly and the trigger messages are send upon download from the LDM, this means that edexc will try to ingest data (using /awips2/ldm/dev/notifyAWIPS2-unidata.py) before the EDEX server is running - these attempts result in an error until the EDEX is started. To avoid losing any data, a queue was implemented so that any files downloaded from the LDM prior to EDEX being fully operational are queued up then ingested once the EDEX is started. The way it does this is checking the EDEX logs to see if it is started.

Originally, this caused the ingestion into EDEX to fail. The failure could be checked by running grep nc4 /awips2/edex/logs/* within the edexc container. If files are being successfully ingested, the log files should show INFO messages; if ingestion is failing they will show WARN messages. The reason for this was due to the "Invalid Metadata" issue described below; the bash script workaround described below appears to fix this issue. A version of the code without the queuing system is in the no_queue branch for brevity.

Invalid Metadata

Sometimes during ingestion files will be downloaded from the LDM but not ingested into EDEX. Running grep nc4 awips2/edex/logs/* from within the edexc container will return a list of all ingestion attempts within the logs. Sometimes this will return WARN ... No valid records were found in file. Digging into the actual error, it shows:

Caused by: org.hibernate.NonUniqueResultException: query did not return a unique result: 2
    at org.hibernate.internal.AbstractQueryImpl.uniqueElement(AbstractQueryImpl.java:918) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
    at org.hibernate.internal.CriteriaImpl.uniqueResult(CriteriaImpl.java:396) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
    at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.query(SatMapCoverageDao.java:149) ~[com.raytheon.edex.plugin.satellite.jar:na]
    at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.getOrCreateCoverage(SatMapCoverageDao.java:95) ~[com.raytheon.edex.plugin.satellite.jar:na]
    at com.raytheon.uf.edex.plugin.goesr.geospatial.GoesrProjectionFactory.getCoverage(GoesrProjectionFactory.java:178) ~[com.raytheon.uf.edex.plugin.goesr.jar:na]
    ... 53 common frames omitted

Which shows that the query did not return a unique result. Discussions with the AWIPS team indicate that the rate of ingestion within EDEX is quick enough to have overlapping time stamps which results in two duplicate records being generated in the satellite_spatial database. These records can be viewed by the following steps within the edexc container: psql -U awips -c "SELECT * FROM satellite_spatial;" metadata.

When EDEX finds these duplicate records it causes an error. Discussions with the AWIPS team indicate that this requires changes on the AWIPS side. A temporary fix has been added that goes in an modifies the SQL records to fix this error. These changes are:

COPY /edexc/etc/systemd/psql_duplicate_fix.service /etc/systemd/system/multi-user.target.wants/psql_duplicate_fix.service
COPY /edexc/etc/systemd/psql_duplicate_remover.sh /psql_duplicate_remover.sh
RUN chmod 777 /psql_duplicate_remover.sh

Currently the AWIPS team is looking into solutions on the AWIPS side. If those solutions are implemented, removing the two files listed above and the relevant lines of the Dockerfile will remove this temporary solution.