awips ml design document - Unidata/awips-ml GitHub Wiki
awips-ml Design Document
This document is targeted at developers and provides an overview of how awips-ml works internally. See diagram below for visual description of how it works.
Container Overview
awips-ml is composed of three different docker containers described below. The quickstart and awips-ml guide describe how to run the containers using docker-compose
. The three containers are:
-
edexc: This is a containerized EDEX server. This container ingests data from an upstream LDM. This container also is responsible for responding to requests from CAVE. When new data is ingested from the upstream LDM, this container is responsible for transmitting the file via pygcdm to the
processc
container. Additionally, when transformed data is available in theprocessc
container, theedexc
container is responsible for receiving it via pygcdm. -
processc: This container is responsible for receiving new netCDF data from
edexc
via pygcdm, pre-processing it using custom user scripts, and sending the resulting numpy arrays via HTTP to thetfc
container. When data (in the form of a numpy array) is returned via HTTP from thetfc
container, theprocessc
container is responsible for any user-defined post-processing before sending the resulting netCDF file via pygcdm to theedexc
contianer. -
tfc: This container is responsible for hosting the machine learning model. It is based on the TensorFlow provided docker image. It receives numpy arrays from
processc
and returns the model outputted numpy arrays to theprocessc
container using HTTP in both cases.
Container Configuration
Several user specific customization options are included for awips-ml. There are two primary locations:
/usr/
: This folder has the most forward facing customization options that include pre/post-processing script locations and a location for user's to include their custom conda environment.yml file./edexc/etc/conf/
: This folder contains EDEX specific configuration files. Modification of these files will change what sort of data is ingested from the upstream LDM and how the data is displayed in CAVE. EDEX specific customization questions should be sent to the AWIPS team at [email protected].
Both locations and how to customize their contents are described in the awips-ml usage guide.
Container Networking
This section describes how data is transferred between containers. Generally, awips-ml has two intra-container networking methods that need to be dealt with. The data transfer in/out of the edexc
container is handled internally by the EDEX server.
For intra-container networking, docker-compose
launches a docker network called awipsml-net
for all network communication. The ports that the containers communicate via are defined in /usr/config.yaml
. Note that all ports/hostnames defined in this file are defined as opposites. Generally this file does not need to be changed unless something precludes using the default ports. The only exception to this is:
- the
ml_model_location
field which specifies the ML model path. This is discussed in the usage guide. - the
variable_spec
field which specifies the netCDF variable to transfer via pygcdm.
edexc
⇄ processc
networking
Data is transferred between edexc
and processc
using pygcdm which is based on gCDM which is the Java implementation of gRPC for the Common Data Model (more of this is discussed in the README for pygcdm). At a high level, this allows netCDF data to be transferred more efficiently than via HTTP.
pygcdm (and gCDM) is implemented such that the transaction is initiated by sending a request message to a server. The server then responds with a response message. As shown in the diagram above, edexc
sends header/data requests to processc
and receives header/data responses (and vice-versa).
Because response messages can only be sent in response to a request, this means that the edexc
container needs a way of prompting processc
to request data whenever new data is ingested from the LDM. To accomplish this, edexc
sends a string containing the local file path of the newly ingested data (via socket). This message is received into an asynchronous queue in processc
. When processc
is ready, it pops the file path off the queue and uses this as part of it's request message to edexc
. When processc
has the transformed data from tfc
it uses a similar process to prompt edexc
to request the data for visualization via CAVE.
edexc
starts the socket "trigger" process when new data is ingested from the LDM. When this data is ingested, it triggers an EXEC
function (defined here) that calls a python trigger function (defined here) that sends the file path via socket. Data being ingested into the EDEX is handled by a utility included with the EDEX install (located within the edexc
container at /awips2/ldm/dev/notifyAWIPS2-unidata.py
).
The mechanism for the processc
trigger process is more simple. It asynchronously waits to pop data from the trigger queue, requests the data, sends it to tfc
then sends the file path of the transformed data back to tfc
immediately. It doesn't rely on any existing internals like edexc
does so is a more simple implementation.
processc
⇄ tfc
networking
The processc
networking with tfc
is much less complex than with edexc
. It transfers numpy arrays via HTTP to a specified port on the tfc
container and receives the ML model output via HTTP.
/server/
This folder is the "bread and butter" of awips-ml and is the location where the networking and processing code lives. When docker-compose
starts the containers and docker network, it also starts these functions within the appropriate containers. There are three files in this folder which are discussed in this section.
container_servers.py
This code is what allows edexc
and processc
to listen for and respond to data requests via pygcdm. The code contains a BaseServer
task which implements the low level grpc and trigger functionality. ProcessContainerServer
and EDEXContainerServer
inherit from the BaseServer
class and implement container specific functionality.
Both classes read from the usr/config.yaml
file when instantiated and use it's contents for deciding which ports to request/respond on etc. Networking is generally implemented using asynchronous python built-ins where possible to allow for edexc
and processc
to theoretically be deployed on different machines.
grpc_api.py
This code is a low-level wrapper around the pygcdm library. container_servers.py
imports this code and uses it for making header/data requests and sending header/data responses.
trigger.py
This code is a small convenience function that is invoked by the edexc
container to send a message containing the file path of the newly ingested data when a new file is ingested via EDEX. processc
has this trigger functionality implemented in the ProcessContainerServer
class. The reason it is broken out for edexc
is because any trigger event occurs via the EDEX server calling the EXEC
command (described previously) when a new file is ingested.
Docker Specific Stuff
This section describes the nuances of the specific containers including a discussion of the Dockerfile
's. Discussion of what is going on in the docker-compose.yml
file is also provided.
docker-compose.yml
This file is generally pretty simple to understand. At a high level, this launches all three awips-ml containers by collating all the individual run instructions. Theoretically each container could be launched independently via a docker run
command with lots of options; the docker-compose
command abstracts the need for this into one file. Reference the docker-compose documentation for general information, specific nuances in this file are:
edexc
must be started with theprivileged: true
flag because it uses systemd as it's init process as a byproduct of basically running centos in a container (necessary for running EDEX). Certain systemd functionality does not work without this flag meaning awips-ml will not work without this flag. Additionally, as discussed in the troubleshooting guide, Docker versions >3.5.2 may not work due to breaking changes that were introduced by Docker.- The
command
function is provides the starting command for each of the containers:edexc
usescommand: ["/usr/sbin/init"]
which kicks off systemd as PID 1. More discussion of awips-ml specific systemd functionality is provided below.processc
usescommand: python server/container_servers.py process_container
which starts theserver/container_servers.py
function using theprocess_container
argument. This basically starts listening for pygcdm or trigger requests and runs indefinitely.tfc
usescommand: tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=model --model_base_path=/models/model
which starts hosting the specified TensorFlow model on the specified ports.
edexc
This is the most complex docker container in awips-ml. The Dockerfile
does several things, listed below in approximate order:
- Install an EDEX server into the container
- Install conda into the container
- Creates a conda environment with the necessary dependencies. This conda environment is used to run
server/container_server.py edex_container
; the reason the default container python executable is not used is to avoid conflicts because the EDEX server relies on python 2 but pygcdm requires python 3. - Modify EDEX config files (described in the user guide)
- Setup the container to run systemd in the container
- Copy in the awips-ml specifc systemd init service files
- Clean up the container
Besides the config files found in /edexc/etc/conf/
which are discussed in the awips-ml user guide, the other config files are systemd specific and can be found in edexc/etc/systemd/
. The EDEX install script (awips_install.sh
) relies on having centos7 as the operating system and no workaround was found to emulate centos7 in a container without including systemd. As such, systemd is leveraged to launch some services on init; running systemd in a container is generally frowned up in Docker but no other workaround was discovered. These service files are:
edex_start.service
: This is an init service that starts the EDEX server by calling/usr/bin/edex start
.listener_start.service
: This is an init service that starts theserver/container_servers.py edex_container
function and listens/responds to triggers and pygcdm requests.logger_redirect.service
: This is an init service that simply redirects stdout output fromlistener_start.service
to a specific process that is viewable in the docker logs (this output is viewable as output from thedocker-compose up
command).
tfc
This is a simple container that just starts from the TensorFlow model serving development image and deploys a model. See the usage guide for how to specify a non-dummy model. The build_dummy_model.py
script is intended as a place holder to cut down on repo size instead of having a saved model.
processc
This is another simple container that installs the dependencies for the server/container_servers.py
code and also installs conda so that users can use their own custom conda environment. Additionally it runs any custom shell script defined by the user. Launching server/container_servers.py process_container
is handled by the docker-compose.yml
file.
Known Issues
"Queue: \'external.dropbox\' not found"
There is a lag between the startup of the LDM and the EDEX server within the edexc
container; this means that data is downloaded from the upstream LDM but is not ingested into the EDEX server. Because the server/container_servers.py
script starts instantly and the trigger messages are send upon download from the LDM, this means that edexc
will try to ingest data (using /awips2/ldm/dev/notifyAWIPS2-unidata.py
) before the EDEX server is running - these attempts result in an error until the EDEX is started. To avoid losing any data, a queue was implemented so that any files downloaded from the LDM prior to EDEX being fully operational are queued up then ingested once the EDEX is started. The way it does this is checking the EDEX logs to see if it is started.
Originally, this caused the ingestion into EDEX to fail. The failure could be checked by running grep nc4 /awips2/edex/logs/*
within the edexc
container. If files are being successfully ingested, the log files should show INFO
messages; if ingestion is failing they will show WARN
messages. The reason for this was due to the "Invalid Metadata" issue described below; the bash script workaround described below appears to fix this issue. A version of the code without the queuing system is in the no_queue
branch for brevity.
Invalid Metadata
Sometimes during ingestion files will be downloaded from the LDM but not ingested into EDEX. Running grep nc4 awips2/edex/logs/*
from within the edexc
container will return a list of all ingestion attempts within the logs. Sometimes this will return WARN ... No valid records were found in file
. Digging into the actual error, it shows:
Caused by: org.hibernate.NonUniqueResultException: query did not return a unique result: 2
at org.hibernate.internal.AbstractQueryImpl.uniqueElement(AbstractQueryImpl.java:918) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
at org.hibernate.internal.CriteriaImpl.uniqueResult(CriteriaImpl.java:396) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final]
at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.query(SatMapCoverageDao.java:149) ~[com.raytheon.edex.plugin.satellite.jar:na]
at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.getOrCreateCoverage(SatMapCoverageDao.java:95) ~[com.raytheon.edex.plugin.satellite.jar:na]
at com.raytheon.uf.edex.plugin.goesr.geospatial.GoesrProjectionFactory.getCoverage(GoesrProjectionFactory.java:178) ~[com.raytheon.uf.edex.plugin.goesr.jar:na]
... 53 common frames omitted
Which shows that the query did not return a unique result
. Discussions with the AWIPS team indicate that the rate of ingestion within EDEX is quick enough to have overlapping time stamps which results in two duplicate records being generated in the satellite_spatial
database. These records can be viewed by the following steps within the edexc
container: psql -U awips -c "SELECT * FROM satellite_spatial;" metadata
.
When EDEX finds these duplicate records it causes an error. Discussions with the AWIPS team indicate that this requires changes on the AWIPS side. A temporary fix has been added that goes in an modifies the SQL records to fix this error. These changes are:
awips-ml/edexc/etc/systemd/psql_duplicate_remover.sh
: bash script that runs indefinitely to change the records to avoid the ingestion errorawips-ml/edexc/etc/systemd/psql_duplicate_fix.service
: init service that starts the bash scriptawips-ml/edexc/Dockerfile
: the following lines in the Dockerfile include the init service and bash script:
COPY /edexc/etc/systemd/psql_duplicate_fix.service /etc/systemd/system/multi-user.target.wants/psql_duplicate_fix.service
COPY /edexc/etc/systemd/psql_duplicate_remover.sh /psql_duplicate_remover.sh
RUN chmod 777 /psql_duplicate_remover.sh
Currently the AWIPS team is looking into solutions on the AWIPS side. If those solutions are implemented, removing the two files listed above and the relevant lines of the Dockerfile will remove this temporary solution.