awips ml design document - rmcsqrd/awips-ml Wiki
awips-ml Design Document
This document is targeted at developers and provides an overview of how awips-ml works internally. See diagram below for visual description of how it works.
awips-ml is composed of three different docker containers described below. The quickstart and awips-ml guide describe how to run the containers using
docker-compose. The three containers are:
edexc: This is a containerized EDEX server. This container ingests data from an upstream LDM. This container also is responsible for responding to requests from CAVE. When new data is ingested from the upstream LDM, this container is responsible for transmitting the file via pygcdm to the
processccontainer. Additionally, when transformed data is available in the
edexccontainer is responsible for receiving it via pygcdm.
processc: This container is responsible for receiving new netCDF data from
edexcvia pygcdm, pre-processing it using custom user scripts, and sending the resulting numpy arrays via HTTP to the
tfccontainer. When data (in the form of a numpy array) is returned via HTTP from the
processccontainer is responsible for any user-defined post-processing before sending the resulting netCDF file via pygcdm to the
tfc: This container is responsible for hosting the machine learning model. It is based on the TensorFlow provided docker image. It receives numpy arrays from
processcand returns the model outputted numpy arrays to the
processccontainer using HTTP in both cases.
Several user specific customization options are included for awips-ml. There are two primary locations:
/usr/: This folder has the most forward facing customization options that include pre/post-processing script locations and a location for user's to include their custom conda environment.yml file.
/edexc/etc/conf/: This folder contains EDEX specific configuration files. Modification of these files will change what sort of data is ingested from the upstream LDM and how the data is displayed in CAVE. EDEX specific customization questions should be sent to the AWIPS team at [email protected].
Both locations and how to customize their contents are described in the awips-ml usage guide.
This section describes how data is transferred between containers. Generally, awips-ml has two intra-container networking methods that need to be dealt with. The data transfer in/out of the
edexc container is handled internally by the EDEX server.
For intra-container networking,
docker-compose launches a docker network called
awipsml-net for all network communication. The ports that the containers communicate via are defined in
/usr/config.yaml. Note that all ports/hostnames defined in this file are defined as opposites. Generally this file does not need to be changed unless something precludes using the default ports. The only exception to this is:
ml_model_locationfield which specifies the ML model path. This is discussed in the usage guide.
variable_specfield which specifies the netCDF variable to transfer via pygcdm.
Data is transferred between
processc using pygcdm which is based on gCDM which is the Java implementation of gRPC for the Common Data Model (more of this is discussed in the README for pygcdm). At a high level, this allows netCDF data to be transferred more efficiently than via HTTP.
pygcdm (and gCDM) is implemented such that the transaction is initiated by sending a request message to a server. The server then responds with a response message. As shown in the diagram above,
edexc sends header/data requests to
processc and receives header/data responses (and vice-versa).
Because response messages can only be sent in response to a request, this means that the
edexc container needs a way of prompting
processc to request data whenever new data is ingested from the LDM. To accomplish this,
edexc sends a string containing the local file path of the newly ingested data (via socket). This message is received into an asynchronous queue in
processc is ready, it pops the file path off the queue and uses this as part of it's request message to
processc has the transformed data from
tfc it uses a similar process to prompt
edexc to request the data for visualization via CAVE.
edexc starts the socket "trigger" process when new data is ingested from the LDM. When this data is ingested, it triggers an
EXEC function (defined here) that calls a python trigger function (defined here) that sends the file path via socket. Data being ingested into the EDEX is handled by a utility included with the EDEX install (located within the
edexc container at
The mechanism for the
processc trigger process is more simple. It asynchronously waits to pop data from the trigger queue, requests the data, sends it to
tfc then sends the file path of the transformed data back to
tfc immediately. It doesn't rely on any existing internals like
edexc does so is a more simple implementation.
processc networking with
tfc is much less complex than with
edexc. It transfers numpy arrays via HTTP to a specified port on the
tfc container and receives the ML model output via HTTP.
This folder is the "bread and butter" of awips-ml and is the location where the networking and processing code lives. When
docker-compose starts the containers and docker network, it also starts these functions within the appropriate containers. There are three files in this folder which are discussed in this section.
This code is what allows
processc to listen for and respond to data requests via pygcdm. The code contains a
BaseServer task which implements the low level grpc and trigger functionality.
EDEXContainerServer inherit from the
BaseServer class and implement container specific functionality.
Both classes read from the
usr/config.yaml file when instantiated and use it's contents for deciding which ports to request/respond on etc. Networking is generally implemented using asynchronous python built-ins where possible to allow for
processc to theoretically be deployed on different machines.
This code is a low-level wrapper around the pygcdm library.
container_servers.py imports this code and uses it for making header/data requests and sending header/data responses.
This code is a small convenience function that is invoked by the
edexc container to send a message containing the file path of the newly ingested data when a new file is ingested via EDEX.
processc has this trigger functionality implemented in the
ProcessContainerServer class. The reason it is broken out for
edexc is because any trigger event occurs via the EDEX server calling the
EXEC command (described previously) when a new file is ingested.
Docker Specific Stuff
This section describes the nuances of the specific containers including a discussion of the
Dockerfile's. Discussion of what is going on in the
docker-compose.yml file is also provided.
This file is generally pretty simple to understand. At a high level, this launches all three awips-ml containers by collating all the individual run instructions. Theoretically each container could be launched independently via a
docker run command with lots of options; the
docker-compose command abstracts the need for this into one file. Reference the docker-compose documentation for general information, specific nuances in this file are:
edexcmust be started with the
privileged: trueflag because it uses systemd as it's init process as a byproduct of basically running centos in a container (necessary for running EDEX). Certain systemd functionality does not work without this flag meaning awips-ml will not work without this flag. Additionally, as discussed in the troubleshooting guide, Docker versions >3.5.2 may not work due to breaking changes that were introduced by Docker.
commandfunction is provides the starting command for each of the containers:
command: ["/usr/sbin/init"]which kicks off systemd as PID 1. More discussion of awips-ml specific systemd functionality is provided below.
command: python server/container_servers.py process_containerwhich starts the
server/container_servers.pyfunction using the
process_containerargument. This basically starts listening for pygcdm or trigger requests and runs indefinitely.
command: tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=model --model_base_path=/models/modelwhich starts hosting the specified TensorFlow model on the specified ports.
This is the most complex docker container in awips-ml. The
Dockerfile does several things, listed below in approximate order:
- Install an EDEX server into the container
- Install conda into the container
- Creates a conda environment with the necessary dependencies. This conda environment is used to run
server/container_server.py edex_container; the reason the default container python executable is not used is to avoid conflicts because the EDEX server relies on python 2 but pygcdm requires python 3.
- Modify EDEX config files (described in the user guide)
- Setup the container to run systemd in the container
- Copy in the awips-ml specifc systemd init service files
- Clean up the container
Besides the config files found in
/edexc/etc/conf/ which are discussed in the awips-ml user guide, the other config files are systemd specific and can be found in
edexc/etc/systemd/. The EDEX install script (
awips_install.sh) relies on having centos7 as the operating system and no workaround was found to emulate centos7 in a container without including systemd. As such, systemd is leveraged to launch some services on init; running systemd in a container is generally frowned up in Docker but no other workaround was discovered. These service files are:
edex_start.service: This is an init service that starts the EDEX server by calling
listener_start.service: This is an init service that starts the
server/container_servers.py edex_containerfunction and listens/responds to triggers and pygcdm requests.
logger_redirect.service: This is an init service that simply redirects stdout output from
listener_start.serviceto a specific process that is viewable in the docker logs (this output is viewable as output from the
This is a simple container that just starts from the TensorFlow model serving development image and deploys a model. See the usage guide for how to specify a non-dummy model. The
build_dummy_model.py script is intended as a place holder to cut down on repo size instead of having a saved model.
This is another simple container that installs the dependencies for the
server/container_servers.py code and also installs conda so that users can use their own custom conda environment. Additionally it runs any custom shell script defined by the user. Launching
server/container_servers.py process_container is handled by the
"Queue: \'external.dropbox\' not found"
There is a lag between the startup of the LDM and the EDEX server within the
edexc container; this means that data is downloaded from the upstream LDM but is not ingested into the EDEX server. Because the
server/container_servers.py script starts instantly and the trigger messages are send upon download from the LDM, this means that
edexc will try to ingest data (using
/awips2/ldm/dev/notifyAWIPS2-unidata.py) before the EDEX server is running - these attempts result in an error until the EDEX is started. To avoid losing any data, a queue was implemented so that any files downloaded from the LDM prior to EDEX being fully operational are queued up then ingested once the EDEX is started. The way it does this is checking the EDEX logs to see if it is started.
Originally, this caused the ingestion into EDEX to fail. The failure could be checked by running
grep nc4 /awips2/edex/logs/* within the
edexc container. If files are being successfully ingested, the log files should show
INFO messages; if ingestion is failing they will show
WARN messages. The reason for this was due to the "Invalid Metadata" issue described below; the bash script workaround described below appears to fix this issue. A version of the code without the queuing system is in the
no_queue branch for brevity.
Sometimes during ingestion files will be downloaded from the LDM but not ingested into EDEX. Running
grep nc4 awips2/edex/logs/* from within the
edexc container will return a list of all ingestion attempts within the logs. Sometimes this will return
WARN ... No valid records were found in file. Digging into the actual error, it shows:
Caused by: org.hibernate.NonUniqueResultException: query did not return a unique result: 2 at org.hibernate.internal.AbstractQueryImpl.uniqueElement(AbstractQueryImpl.java:918) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final] at org.hibernate.internal.CriteriaImpl.uniqueResult(CriteriaImpl.java:396) ~[hibernate-core-4.2.15.Final.jar:4.2.15.Final] at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.query(SatMapCoverageDao.java:149) ~[com.raytheon.edex.plugin.satellite.jar:na] at com.raytheon.edex.plugin.satellite.dao.SatMapCoverageDao.getOrCreateCoverage(SatMapCoverageDao.java:95) ~[com.raytheon.edex.plugin.satellite.jar:na] at com.raytheon.uf.edex.plugin.goesr.geospatial.GoesrProjectionFactory.getCoverage(GoesrProjectionFactory.java:178) ~[com.raytheon.uf.edex.plugin.goesr.jar:na] ... 53 common frames omitted
Which shows that the
query did not return a unique result. Discussions with the AWIPS team indicate that the rate of ingestion within EDEX is quick enough to have overlapping time stamps which results in two duplicate records being generated in the
satellite_spatial database. These records can be viewed by the following steps within the
psql -U awips -c "SELECT * FROM satellite_spatial;" metadata.
When EDEX finds these duplicate records it causes an error. Discussions with the AWIPS team indicate that this requires changes on the AWIPS side. A temporary fix has been added that goes in an modifies the SQL records to fix this error. These changes are:
awips-ml/edexc/etc/systemd/psql_duplicate_remover.sh: bash script that runs indefinitely to change the records to avoid the ingestion error
awips-ml/edexc/etc/systemd/psql_duplicate_fix.service: init service that starts the bash script
awips-ml/edexc/Dockerfile: the following lines in the Dockerfile include the init service and bash script:
COPY /edexc/etc/systemd/psql_duplicate_fix.service /etc/systemd/system/multi-user.target.wants/psql_duplicate_fix.service COPY /edexc/etc/systemd/psql_duplicate_remover.sh /psql_duplicate_remover.sh RUN chmod 777 /psql_duplicate_remover.sh
Currently the AWIPS team is looking into solutions on the AWIPS side. If those solutions are implemented, removing the two files listed above and the relevant lines of the Dockerfile will remove this temporary solution.