awips ml design document - Unidata/awips-ml GitHub Wiki

awips-ml Design Document

This document is targeted at developers and provides an overview of how awips-ml works internally. See diagram below for visual description of how it works.

Container Overview

awips-ml is composed of three different docker containers described below. The quickstart and awips-ml guide describe how to run the containers using docker-compose. The three containers are:

  • edexc: This is a containerized EDEX server. This container ingests data from an upstream LDM. This container also is responsible for responding to requests from CAVE. When new data is ingested from the upstream LDM, this container is responsible for transmitting the file via pygcdm to the processc container. Additionally, when transformed data is available in the processc container, the edexc container is responsible for receiving it via pygcdm.

  • processc: This container is responsible for receiving new netCDF data from edexc via pygcdm, pre-processing it using custom user scripts, and sending the resulting numpy arrays via HTTP to the tfc container. When data (in the form of a numpy array) is returned via HTTP from the tfc container, the processc container is responsible for any user-defined post-processing before sending the resulting netCDF file via pygcdm to the edexc contianer.

  • tfc: This container is responsible for hosting the machine learning model. It is based on the TensorFlow provided docker image. It receives numpy arrays from processc and returns the model outputted numpy arrays to the processc container using HTTP in both cases.

Container Configuration

Several user specific customization options are included for awips-ml. There are two primary locations:

  • /usr/: This folder has the most forward facing customization options that include pre/post-processing script locations and a location for user's to include their custom conda environment.yml file.
  • /edexc/etc/conf/: This folder contains EDEX specific configuration files. Modification of these files will change what sort of data is ingested from the upstream LDM and how the data is displayed in CAVE. EDEX specific customization questions should be sent to the AWIPS team at [email protected].

Both locations and how to customize their contents are described in the awips-ml usage guide.

Container Networking

This section describes how data is transferred between containers. Generally, awips-ml has two intra-container networking methods that need to be dealt with. The data transfer in/out of the edexc container is handled internally by the EDEX server.

For intra-container networking, docker-compose launches a docker network called awipsml-net for all network communication. The ports that the containers communicate via are defined in /usr/config.yaml. Note that all ports/hostnames defined in this file are defined as opposites. Generally this file does not need to be changed unless something precludes using the default ports. The only exception to this is:

  • the ml_model_location field which specifies the ML model path. This is discussed in the usage guide.
  • the variable_spec field which specifies the netCDF variable to transfer via pygcdm.

edexcprocessc networking

Data is transferred between edexc and processc using pygcdm which is based on gCDM which is the Java implementation of gRPC for the Common Data Model (more of this is discussed in the README for pygcdm). At a high level, this allows netCDF data to be transferred more efficiently than via HTTP.

pygcdm (and gCDM) is implemented such that the transaction is initiated by sending a request message to a server. The server then responds with a response message. As shown in the diagram above, edexc sends header/data requests to processc and receives header/data responses (and vice-versa).

Because response messages can only be sent in response to a request, this means that the edexc container needs a way of prompting processc to request data whenever new data is ingested from the LDM. To accomplish this, edexc sends a string containing the local file path of the newly ingested data (via socket). This message is received into an asynchronous queue in processc. When processc is ready, it pops the file path off the queue and uses this as part of it's request message to edexc. When processc has the transformed data from tfc it uses a similar process to prompt edexc to request the data for visualization via CAVE.

edexc starts the socket "trigger" process when new data is ingested from the LDM. When this data is ingested, it triggers an EXEC function (defined here) that calls a python trigger function (defined here) that sends the file path via socket. Data being ingested into the EDEX is handled by a utility included with the EDEX install (located within the edexc container at /awips2/ldm/dev/notifyAWIPS2-unidata.py).

The mechanism for the processc trigger process is more simple. It asynchronously waits to pop data from the trigger queue, requests the data, sends it to tfc then sends the file path of the transformed data back to tfc immediately. It doesn't rely on any existing internals like edexc does so is a more simple implementation.

processctfc networking

The processc networking with tfc is much less complex than with edexc. It transfers numpy arrays via HTTP to a specified port on the tfc container and receives the ML model output via HTTP.

/server/

This folder is the "bread and butter" of awips-ml and is the location where the networking and processing code lives. When docker-compose starts the containers and docker network, it also starts these functions within the appropriate containers. There are three files in this folder which are discussed in this section.

container_servers.py

This code is what allows edexc and processc to listen for and respond to data requests via pygcdm. The code contains a BaseServer task which implements the low level grpc and trigger functionality. ProcessContainerServer and EDEXContainerServer inherit from the BaseServer class and implement container specific functionality.

Both classes read from the usr/config.yaml file when instantiated and use it's contents for deciding which ports to request/respond on etc. Networking is generally implemented using asynchronous python built-ins where possible to allow for edexc and processc to theoretically be deployed on different machines.

grpc_api.py

This code is a low-level wrapper around the pygcdm library. container_servers.py imports this code and uses it for making header/data requests and sending header/data responses.

trigger.py

This code is a small convenience function that is invoked by the edexc container to send a message containing the file path of the newly ingested data when a new file is ingested via EDEX. processc has this trigger functionality implemented in the ProcessContainerServer class. The reason it is broken out for edexc is because any trigger event occurs via the EDEX server calling the EXEC command (described previously) when a new file is ingested.

Docker Specific Stuff

This section describes the nuances of the specific containers including a discussion of the Dockerfile's. Discussion of what is going on in the docker-compose.yml file is also provided.

docker-compose.yml

This file is generally pretty simple to understand. At a high level, this launches all three awips-ml containers by collating all the individual run instructions. Theoretically each container could be launched independently via a docker run command with lots of options; the docker-compose command abstracts the need for this into one file. Reference the docker-compose documentation for general information, specific nuances in this file are:

  • edexc must be started with the privileged: true flag because it uses systemd as it's init process as a byproduct of basically running centos in a container (necessary for running EDEX). Certain systemd functionality does not work without this flag meaning awips-ml will not work without this flag. Additionally, as discussed in the troubleshooting guide, Docker versions >3.5.2 may not work due to breaking changes that were introduced by Docker.
  • The command function is provides the starting command for each of the containers:
    • edexc uses command: ["/usr/sbin/init"] which kicks off systemd as PID 1. More discussion of awips-ml specific systemd functionality is provided below.
    • processc uses command: python server/container_servers.py process_container which starts the server/container_servers.py function using the process_container argument. This basically starts listening for pygcdm or trigger requests and runs indefinitely.
    • tfc uses command: tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=model --model_base_path=/models/model which starts hosting the specified TensorFlow model on the specified ports.

edexc

This is the most complex docker container in awips-ml. The Dockerfile does several things, listed below in approximate order:

  1. Install an EDEX server into the container
  2. Install conda into the container
  3. Creates a conda environment with the necessary dependencies. This conda environment is used to run server/container_server.py edex_container; the reason the default container python executable is not used is to avoid conflicts because the EDEX server relies on python 2 but pygcdm requires python 3.
  4. Modify EDEX config files (described in the user guide)
  5. Setup the container to run systemd in the container
  6. Copy in the awips-ml specifc systemd init service files
  7. Clean up the container

Besides the config files found in /edexc/etc/conf/ which are discussed in the awips-ml user guide, the other config files are systemd specific and can be found in edexc/etc/systemd/. The EDEX install script (awips_install.sh) relies on having centos7 as the operating system and no workaround was found to emulate centos7 in a container without including systemd. As such, systemd is leveraged to launch some services on init; running systemd in a container is generally frowned up in Docker but no other workaround was discovered. These service files are:

  • edex_start.service: This is an init service that starts the EDEX server by calling /usr/bin/edex start.
  • listener_start.service: This is an init service that starts the server/container_servers.py edex_container function and listens/responds to triggers and pygcdm requests.
  • logger_redirect.service: This is an init service that simply redirects stdout output from listener_start.service to a specific process that is viewable in the docker logs (this output is viewable as output from the docker-compose up command).

tfc

This is a simple container that just starts from the TensorFlow model serving development image and deploys a model. See the usage guide for how to specify a non-dummy model. The build_dummy_model.py script is intended as a place holder to cut down on repo size instead of having a saved model.

processc

This is another simple container that installs the dependencies for the server/container_servers.py code and also installs conda so that users can use their own custom conda environment. Additionally it runs any custom shell script defined by the user. Launching server/container_servers.py process_container is handled by the docker-compose.yml file.

Known Issues

"Queue: \'external.dropbox\' not found"

There is a lag between the startup of the LDM and the EDEX server within the edexc container; this means that data is downloaded from the upstream LDM but is not ingested into the EDEX server. Because the server/container_servers.py script starts instantly and the trigger messages are send upon download from the LDM, this means that edexc will try to ingest data (using /awips2/ldm/dev/notifyAWIPS2-unidata.py) before the EDEX server is running - these attempts result in an error until the EDEX is started. To avoid losing any data, a queue was implemented so that any files downloaded from the LDM prior to EDEX being fully operational are queued up then ingested once the EDEX is started. The way it does this is checking the EDEX logs to see if it is started.

Originally, this caused the ingestion into EDEX to fail. The failure could be checked by running grep nc4 /awips2/edex/logs/* within the edexc container. If files are being successfully ingested, the log files should show INFO messages; if ingestion is failing they will show WARN messages. The reason for this was due to the "Invalid Metadata" issue described below; the bash script workaround described below appears to fix this issue. A version of the code without the queuing system is in the no_queue branch for brevity.

Invalid Metadata

Sometimes during ingestion files will be downloaded from the LDM but not ingested into EDEX. Running grep nc4 awips2/edex/logs/* from within the edexc container will return a list of all ingestion attempts within the logs. Sometimes this will return WARN ... No valid records were found in file. Digging into the actual error, it shows:

Caused by: org.hibernate.NonUniqueResultException: query did not return a unique result: 2
    at org.hibernate.internal.AbstractQueryImpl.uniqueElement(AbstractQueryImpl.java:918) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
    at org.hibernate.internal.CriteriaImpl.uniqueResult(CriteriaImpl.java:396) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
    at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.query(SatMapCoverageDao.java:149) ~[com.raytheon.edex.plugin.satellite.jar:na]
    at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.getOrCreateCoverage(SatMapCoverageDao.java:95) ~[com.raytheon.edex.plugin.satellite.jar:na]
    at com.raytheon.uf.edex.plugin.goesr.geospatial.GoesrProjectionFactory.getCoverage(GoesrProjectionFactory.java:178) ~[com.raytheon.uf.edex.plugin.goesr.jar:na]
    ... 53 common frames omitted

Which shows that the query did not return a unique result. Discussions with the AWIPS team indicate that the rate of ingestion within EDEX is quick enough to have overlapping time stamps which results in two duplicate records being generated in the satellite_spatial database. These records can be viewed by the following steps within the edexc container: psql -U awips -c "SELECT * FROM satellite_spatial;" metadata.

When EDEX finds these duplicate records it causes an error. Discussions with the AWIPS team indicate that this requires changes on the AWIPS side. A temporary fix has been added that goes in an modifies the SQL records to fix this error. These changes are:

  • awips-ml/edexc/etc/systemd/psql_duplicate_remover.sh: bash script that runs indefinitely to change the records to avoid the ingestion error
  • awips-ml/edexc/etc/systemd/psql_duplicate_fix.service: init service that starts the bash script
  • awips-ml/edexc/Dockerfile: the following lines in the Dockerfile include the init service and bash script:
COPY /edexc/etc/systemd/psql_duplicate_fix.service /etc/systemd/system/multi-user.target.wants/psql_duplicate_fix.service
COPY /edexc/etc/systemd/psql_duplicate_remover.sh /psql_duplicate_remover.sh
RUN chmod 777 /psql_duplicate_remover.sh

Currently the AWIPS team is looking into solutions on the AWIPS side. If those solutions are implemented, removing the two files listed above and the relevant lines of the Dockerfile will remove this temporary solution.