On Prem Production Deployment - idaholab/Deep-Lynx GitHub Wiki

On Premises Production Level Deployment - Kubernetes

This page goes over some of the best practices for deploying DeepLynx into an on-premises situation on Kubernetes, where cloud deployment is either unwanted or not feasible. We've designed DeepLynx to work equally well both on-prem and in the cloud, as we expected many users of the warehouse might be working in air-gapped systems.

This guide is not comprehensive and might not apply to your situation. Please feel free to reach out to the DeepLynx development team for more information.

Note: The DeepLynx team is actively working on a Helm chart to simplify this process even further, but have not yet completed it. This guide will be updated when it's completed.

Hardware Requirements

Please note these are recommended for your first deployment. We highly recommend monitoring your cluster's resource use and adjusting these initial values as you see thresholds surpassed. This serves more of a guideline of ratio's of hardware than a hard and fast rule of system requirements.

Pod Role	Min Memory	Min CPUS	Storage	Notes
DeepLynx Server	2gb	2	10gb	Comes with the same workers DeepLynx Worker has, run only this at first and scale to a worker node if needed
DeepLynx Worker	2gb	4	10gb	Only if main server cannot keep up with data ingestion workloads, you will see that happen when RabbitMQ's queues back up significantly
PostgreSQL(TimescaleDB Extension Optional)	8gb	4	100gb	Scale this up significantly if you're hitting CPU limits, can cluster out - see PostgreSQL/TimescaleDB section
Redis	4gb	2	30gb

Cluster Preparation

It is highly recommended you complete the following on your cluster.

Install and use ngnix as your reverse proxy and ingress controller - here
Install some kind of metrics collection setup, such as Prometheus
Generate an asymmetric RSA keypair and store either as a secret or in a volume that can be mounted. This is required for DeepLynx's encryption, and you do not want to use its autogenerated keypair

The rest of this guide will be split into sections - each one corresponding to an essential part of the deployment.

DeepLynx Server & Worker

Acquisition

The DeepLynx application is completely containerized and is publicly available here. It is recommended that you use this official image, but you are also welcome to build from source using the included Dockerfile.

Configuration

DeepLynx is configured through setting environment variables on the container running the DeepLynx image. Below is a list of all possible environment variables. Where appropriate defaults have been changed to reflect what they would most likely looklike in your kubernetes cluster.

# server application configuration
SERVER_PORT=8090
ROOT_ADDRESS=http://localhost:8090
# the maximum size in megabytes of request bodies sent to DeepLynx
# requests with payloads over this limit will return a 413 note that
# this does not apply to file size, only raw body size
MAX_REQUEST_BODY_SIZE=50

# whether or not to use the server instance to also manage the jobs, defaults to false as the startup command listed in
# the kuberenetes deployment definition below defaults to running the core and workers separately in the single container
RUN_JOBS=false

# comma separated CORs origins, defaults to *
CORS_ORIGIN=

# valid options are blank (defaults to memory), memory, or redis DO NOT USE MEMORY IF YOU ARE CLUSTERING DEEPLYNX
CACHE_PROVIDER=redis
# default time in seconds
CACHE_DEFAULT_TTL=21600
# redis connection string - e.g redis://user:[email protected]:6379/
CACHE_REDIS_CONNECTION_STRING=

# this controls the import id caching on staging data emission and processing, only change
# if you know what you're doing - in seconds
INITIAL_IMPORT_CACHE_TTl=21600
IMPORT_CACHE_TTL=30

# email specific variables, controls the reset password/email validation/ links etc.
# the URLs should refer to your UI implementation's reset password, email validation, and container invite pages
[email protected]
EMAIL_ENABLED=false
EMAIL_VALIDATION_ENFORCED=false
CONTAINER_INVITE_URL=http://localhost:8080/container-invite 

# debug,info,warn,error,silent
LOG_LEVEL=info
# turn on if you notice data not getting ingested or processed
LOG_JOBS=false

# should be in the format postgresql://user:password@hostname:port/database_name
# :port is optional and if included will usually be :5432
CORE_DB_CONNECTION_STRING=postgresql://postgres:deeplynxcore@localhost/deep_lynx
# whether or not you're using the TimescaleDB enabled Postgres. Timescale is recommended
TIMESCALEDB_ENABLED=true

# this must be an absolute path to a RSA private key - if one is not provided it will be generated and saved for you at
# the project root - you don't want this to happen in an ephermal volume because all previously encrypted information
# will fail to decrypt if you wipe the container
ENCRYPTION_KEY_PATH=
# plaintext secret used to generate secure session markers
SESSION_SECRET=

# can also be largeobject if you wish to use the postgresdb as file storage NOT RECOMMENDED FOR LONG TERM USE but great
# for development
FILE_STORAGE_METHOD=minio


# controls how often the export data job is ran
EXPORT_INTERVAL=10m
# controls how many exports can occur at once
EXPORT_DATA_CONCURRENCY=4

# controls which queue system to use, possible values are database, rabbitmq 
QUEUE_SYSTEM=rabbitmq
# RabbitMQ connection string
RABBITMQ_URL=

# queue names should not need to be changed if using RabbitMQ as they will be automatically made when the first events hit
PROCESS_QUEUE_NAME=process
DATA_SOURCES_QUEUE_NAME=data_sources
EVENTS_QUEUE_NAME=events
EDGE_INSERTION_QUEUE_NAME=edge_insertion

EDGE_INSERTION_BACKOFF_MULTIPLIER=5
EDGE_INSERTION_MAX_RETRY=10

# controls whether or not DeepLynx should emit data events
EMIT_EVENTS=true

# whether or not to create a superuser on initial boot, along with the values
# for its email and password - password will be encrypted prior to storage
# it is highly recommended that you DO NOT DO THIS - or that if you do, that you 
# at least highly protect the superuser information and remove it after first boot
INITIAL_SUPERUSER=true
[email protected]
SUPERUSER_PASSWORD=admin

# possible values: token (leave blank for no auth)
AUTH_STRATEGY=token

# If you're using ADFS/SAML authentication these will need to be set
SAML_ENABLED=false
# SAML 2.0 entry point URL
SAML_ADFS_ENTRY_POINT=
# Application (Client) ID
SAML_ADFS_ISSUER=
# Application callback route, registered with Identity provider beforehand
SAML_ADFS_CALLBACK=
# Self signed certificate private key (.key file)
SAML_ADFS_PRIVATE_CERT_PATH=
# x509 certificate extracted from ADFS metadata file
SAML_ADFS_PUBLIC_CERT_PATH=

SAML_ADFS_CLAIMS_EMAIL=
SAML_ADFS_CLAIMS_NAME=

# SMTP Mail server specific settings
SMTP_USERNAME=
SMTP_PASSWORD=
SMTP_HOST=
SMTP_PORT=25
SMTP_TLS=true

# SMTP OAuth2 settings
SMTP_CLIENT_ID=
SMTP_CLIENT_SECRET=
SMTP_REFRESH_TOKEN=
SMTP_ACCESS_TOKEN=

Kubernetes Deployment Example

Here is an example of the Kubernetes deployment. Note that we are overwriting the initial run command, you can do this to enforce whether a DeepLynx instance is server and worker or worker only. All other services must be running before DeepLynx starts, or it will fail.

# This first deployment is for the core deeplynx server
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deeplynx
  namespace: deeplynx
  labels:
    service: deeplynx
    app: deeplynx

spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: deeplynx
  template:
    metadata:
      labels:
        app: deeplynx
    spec:
      containers:
        - name: deeplynx
          image: idaholab/deeplynx:latest
          imagePullPolicy: IfNotPresent
          ## this is the same exact command that gets run on container start, we're simply listing it here for ease of use
          args:
            - /bin/sh
            - -c
            - pm2-runtime ecosystem.config.js
          # see the env file above for all the values you'll need to set here
          env:
            - name: CORE_DB_CONNECTION_STRING
              valueFrom:
                secretKeyRef:
            - name: DB_NAME
              value: '${DB_NAME}'
          ports:
            - containerPort: 8090
          volumeMounts:
            - mountPath: "/var/deeplynx/data"
              name: nfs
      volumes:
        - name: nfs
          persistentVolumeClaim:
            claimName: deeplynx-pv


---
# This second deployment is the DeepLynx worker system. Not required if you have RUN_JOBS=true on your main instance
# and that main instance is keeping up with demand. You typically only spin this up when you need additional processing power
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deeplynx-queue-worker
  namespace: deeplynx
  labels:
    service: deeplynx-queue-worker
    app: deeplynx-queue-worker

spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: deeplynx-queue-worker
  template:
    metadata:
      labels:
        app: deeplynx-queue-worker
    spec:
      containers:
        - name: deeplynx-queue-worker
          image: idaholab/deeplynx:latest
          imagePullPolicy: Always
          # note these are different the deployment above, this only runs the workers
          args:
            - /bin/sh
            - -c
            - pm2-runtime queue_worker.config.js
          env: # you will need the same exact ENV vars from above, minus perhaps SAML and email specific vars
          volumeMounts:
            - mountPath: "/var/deeplynx/data"
              name: nfs
      volumes:
        - name: nfs
          persistentVolumeClaim:
            claimName: deeplynx-worker

PostgreSQL & TimescaleDB

In order to run DeepLynx you must either have a vanilla PostgreSQL database cluster or a PostgreSQL cluster with Timescale DB enabled. The single biggest requirement here is that your Postgres version must be 12 or higher.

When to use TimescaleDB

TimescaleDB is meant to handle large amounts of timeseries or tabular data. It also helps lay the foundation of our graph tables and our raw data retention tables. It has things like auto-partitioning based on a primary key, clustering capabilities and other features that make life managing a Postgres installation easier when dealing with large amounts of data. We recommend using TimescaleDB by default, but you can run without it. We highly recommend that if you are using the Timeseries data sources, or storing things like sensor data or other tabular information like a digital twin would need, that you stick with using TimescaleDB and forgo using PostgreSQL without the extension installed.

Installing TimescaleDB into the Kubernetes Cluster

Luckily, Timescale has an excellent guide for accomplishing this. We will not go into detail here, but instead refer you to that guide in getting a TimescaleDB cluster setup and configured.

Installing PostgreSQL into the Kubernetes Cluster

Much like TimescaleDB - many guides exist that will walk you through getting a high availability (HA) Postgres cluster installed and running. We highly recommend you go with a High Availability configuration if doing more than evaluating the tool - because as soon as data is recorded anywhere it becomes a liability. Configuring for high availability means you also take the proper steps towards replication and data management and disaster recovery. Some useful guides are below.

https://www.percona.com/blog/postgresql-high-availability-and-disaster-recovery-on-kubernetes https://www.crunchydata.com/blog/deploy-high-availability-postgresql-on-kubernetes https://ralph.blog.imixs.com/2021/06/25/postgresql-ha-kubernetes/

Redis

DeepLynx uses Redis to manage its cache and some data operations. When operating in a non-clustered environment, and without a worker, DeepLynx is able to function without Redis. However, as soon as you run multiple DeepLynx instances, or you run DeepLynx core and a worker instance, you must use Redis. We highly recommend you start with Redis out of the box in order to avoid having to set it up later.

We recommend starting with the Bitnami redis Helm chart. This Helm chart's readme goes into how to set up and deploy it in great detail, so we will not seek to copy that data here.