Oms in cloud - openmpp/openmpp.github.io GitHub Wiki

OpenM++ web-service (oms) can provide basic computational resources management for your local computer or cluster of servers on local network or in cloud. It can manage model runs queue if your computational resources (CPU and memory) are limited and also can automatically start and stop cloud servers.

Examples below assuming you are familiar with basics of Oms: openM++ web-service.

If you want to have model runs queue, or using openM++ in cloud and want automatically scale up and down cloud resources, e.g. start and stop virtual machines for model runs then start oms with job control option:

oms -oms.JobDir job

Following directory structure expected:

./        -> oms "root" directory, by default it is current directory
    html/    -> web-UI directory with HTML, js, css, images...
    etc/     -> config files directory, contain template(s) to run models
    log/     -> recommended log files directory
    models/
          bin/  -> default model.exe and model.sqlite directory
          log/  -> default directory for models run log files
          doc/  -> models documentation directory
          home/ -> user personal home directory
              io/download  -> user directory for download files
              io/upload    -> user directory to upload files
    job/  -> model run jobs control directory
          job.ini   -> job control settings
          disk.ini  -> (optional) disk usage control settings to set storage quotas
          active/   -> active model run state files
          history/  -> model run history files
          past/     -> (optional) shadow copy of history folder, invisible to the end user
          queue/    -> model run queue files
          state/    -> jobs state and computational servers state files
               jobs.queue.paused      -> if such file exists then jobs queue is paused
               jobs.queue.all.paused  -> if such file exists then all jobs in all queues are paused

Model runs queue and computational resources (servers, nodes, clusters)

By default oms assumes:

  • all models are running on localhost
  • there are no limits on CPU cores or memory usage

Model run queue on local computer

You can create model run queue on your local computer by setting a limit on number of CPU cores available. To do it modify job.ini file in a job directory, for example:

[Common]
LocalCpu      = 8       ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory   = 0       ; gigabytes, localhost memory limit, zero means no limits

You don't have to set memory limits until model run memory requirements are known.

CPU cores which are you limiting in job.ini does not need to be an actual cores. You can have 8 cores on your PC and set LocalCpu = 16 which allow 200% overload and may significantly slow down your local machine. Or if you set LocalCpu = 4 then your models would be able to use only half of actual cores.

LAN: front-end server and back-end cluster of servers

Example of local network (LAN) cluster:

  • small front-end server with 4 cores
  • 4 back-end servers: cpc-1, cpc-2, cpc-3, cpc-4 with 16 cores each
[Common]
LocalCpu      = 4   ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory   = 0   ; gigabytes, localhost memory limit, zero means no limits
MpiCpu        = 40  ; max MPI cpu cores available for each oms instance, zero means oms instances can use all cpu's available
MpiMemory     = 0   ; gigabytes, max MPI memory available for each oms instance, zero means oms instances can use all memory available
MpiMaxThreads = 8   ; max number of modelling threads per MPI process
MaxErrors     = 10  ; errors threshold for compute server or cluster

Servers   = cpc-1, cpc-2, cpc-3, cpc-4      ; computational servers or clusters

[cpc-1]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits

[cpc-2]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits

[cpc-3]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits

[cpc-4]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits

; OpenMPI hostfile (on Linux)
;
; cpm   slots=1 max_slots=1
; cpc-1 slots=2
; cpc-3 slots=4
;
[hostfile]
HostFileDir = models/log
HostName = @-HOST-@
CpuCores = @-CORES-@
RootLine = cpm slots=1 max_slots=1
HostLine = @-HOST-@ slots=@-CORES-@

; MS-MPI machinefile (on Windows with Microsoft MPI)
;
; cpm:1
; cpc-1:2
; cpc-3:4
;
; [hostfile]
; HostFileDir = models\log
; HostName = @-HOST-@
; CpuCores = @-CORES-@
; RootLine = cpm:1
; HostLine = @-HOST-@:@-CORES-@

Based on job.ini above oms will create MPI hostfile with back-end servers assignment for each particular model run.

In order to use that hostfile you should modify model run template(s) in openM++ etc/ directory. For example on Linux with openMPI:

{{/*
oms web-service:
  Template to run modelName_mpi executable on Linux using OpenMPI

It is not recommended to use root process for modelling

Oms web-service using template for exec.Command(exeName, Args...):
  - skip empty lines
  - substitute template arguments
  - first non-empty line is a name of executable to run
  - each other line is a command line argument for executable

Arguments of template:
  ModelName string            // model name
  ExeStem   string            // base part of model exe name, usually modelName
  Dir       string            // work directory to run the model
  BinDir    string            // bin directory where model exe is located
  MpiNp     int               // number of MPI processes
  HostFile  string            // if not empty then path to hostfile
  Args      []string          // model command line arguments
  Env       map[string]string // environment variables to run the model

Example of result:

  mpirun --hostfile host.ini --bind-to none --oversubscribe -wdir models/bin -x key=value ./modelName_mpi -OpenM.LogToFile false

*/}}

mpirun
--bind-to
none
--oversubscribe
{{with .HostFile}}
--hostfile
{{.}}
{{end}}
{{with .Dir}}
-wdir
{{.}}
{{end}}
{{range $key, $val := .Env}}
-x
{{$key}}={{$val}}
{{end}}
{{.BinDir}}/{{.ExeStem}}_mpi
{{range .Args}}
{{.}}
{{end}}

Note: If you are using OpenMPI then it is a good idea to have --oversubscribe --bind-to none as above in order to avoid MPI models run failure or performance degradation.

If you are using Microsoft MPI on Windows servers then modify etc\ model template file(s) to have it similar to:

{{/*
oms web-service:
  Template to run modelName_mpi.exe on Windows Microsoft MPI using machinefile

To use this template rename it into:
  mpi.ModelRun.template.txt

Oms web-service using template for exec.Command(exeName, Args...):
  - skip empty lines
  - substitute template arguments
  - first non-empty line is a name of executable to run
  - each other line is a command line argument for executable

Arguments of template:
  ModelName string            // model name
  ExeStem   string            // base part of model exe name, usually modelName
  Dir       string            // work directory to run the model
  BinDir    string            // bin directory where model exe is located
  DbPath    string            // absolute path to sqlite database file: models/bin/model.sqlite
  MpiNp     int               // number of MPI processes
  HostFile  string            // if not empty then path to hostfile
  Args      []string          // model command line arguments
  Env       map[string]string // environment variables to run the model

Example of result:
  mpiexec -machinefile hosts.ini -wdir models\bin -env key value ..\bin\modelName_mpi -OpenM.LogToFile false

*/}}

mpiexec
{{with .HostFile}}
-machinefile
{{.}}
{{end}}
{{with .Dir}}
-wdir
{{.}}
{{end}}
{{range $key, $val := .Env}}
-env
{{$key}}
{{$val}}
{{end}}
{{.BinDir}}\{{.ExeStem}}_mpi
{{range .Args}}
{{.}}
{{end}}

Cloud auto scaling: automatically start and stop servers

Use oms jobs control abilities to organize model runs queue and, if required, automatically scale up down cloud resources, e.g.: start and stop virtual machines or nodes.

For example, if you want to have two users: Alice and Bob who are running models then start oms as:

bin/oms -l localhost:4050 -oms.RootDir alice -oms.Name alice -ini oms.ini
bin/oms -l localhost:4060 -oms.RootDir bob   -oms.Name bob   -ini oms.ini

where content of oms.ini is:

[oms]
JobDir        = ../job
EtcDir        = ../etc
HomeDir       = models/home
AllowDownload = true
AllowUpload   = true
LogRequest    = true

[OpenM]
LogFilePath      = log/oms.log
LogToFile        = true
LogUseDailyStamp = true
LogToConsole     = false

Above assume following directory structure:

./    -> current directory
    bin/
        oms    -> oms web service executable, on Windows: `oms.exe`
        dbcopy -> dbcopy utility executable, on Windows: `dbcopy.exe`
    html/    -> web-UI directory with HTML, js, css, images...
    etc/     -> config files directory, contain template(s) to run models
    alice/   -> user Alice "root" directory
        log/     -> recommended Alice's log files directory
        models/
              bin/  -> Alice's model.exe and model.sqlite directory
              log/  -> Alice's directory for models run log files
              doc/  -> models documentation directory
              home/ -> Alice's personal home directory
                  io/download  -> Alice's directory for download files
                  io/upload    -> Alice's directory to upload files
    bob/     -> user Bob "root" directory
        log/     -> recommended Bob's log files directory
        models/
              bin/  -> Bob's model.exe and model.sqlite directory
              log/  -> Bob's directory for models run log files
              doc/  -> models documentation directory
              home/ -> Bob's personal home directory
                  io/download  -> Bob's directory for download files
                  io/upload    -> Bob's directory to upload files
    job/  -> model run jobs control directory, it must be shared between all users
          job.ini   -> (optional) job control settings
          disk.ini  -> (optional) disk usage control settings to set storage quotas for Bob and Alice
          active/   -> active model run state files
          history/  -> model run history files
          past/     -> (optional) shadow copy of history folder, invisible to the end user
          queue/    -> model run queue files
          state/    -> jobs state and computational servers state files
               jobs.queue.paused      -> if such file exists then jobs queue is paused
               jobs.queue.all.paused  -> if such file exists then all jobs in all queues are paused

You don't have to follow that directory structure, it is flexible and can be customized through oms run options.

IMPORTANT: Job directory must be in a SHARED location and accessible to all users who are using the same queue and the same computational resources (servers, nodes, clusters).

You don't need to create OS users, e.g. Alice and Bob does not need a login accounts on your server (cloud, Active Directory, etc.). All you need is to setup some authentication mechanism and reverse proxy which would allow Alice to access localhost:4050 and Bob localhost:4060 on your front-end. Actual OS user can have any name, e.g. oms:

sudo -u oms OM_ROOT=/shared/alice bash -c 'source ~/.bashrc; bin/oms -l localhost:4050 -oms.RootDir alice -oms.Name alice -ini oms.ini &'
sudo -u oms OM_ROOT=/shared/bob   bash -c 'source ~/.bashrc; bin/oms -l localhost:4060 -oms.RootDir bob   -oms.Name bob   -ini oms.ini &'

Cloud disks usage: limit storage space usage

You may want to set the limits on disk space usage and enforce storage cleanup by users. It can be done through job/disk.ini file. If job/disk.ini exists then oms web-service will monitor and report disk usage by user(s) and may set a limit on storage space. You can set a limit for individual user, group of users and grand total space limit on storage space used by all users. If user exceeding disk space quotas then she/he cannot run the model or upload files to cloud, only download is available. User can Cleanup Disk Space through UI.

Example of disk.ini:

; Example of storage usage control settings
;   "user" term below means oms instance
;   "user name" is oms instance name, for example: "localhost_4040"
;
; if job/disk.ini file exists then storage usage control is active

[Common]

; seconds, storage scan interval, if too small then default value used
;
ScanInterval  =   0

; GBytes, user storage quota, default: 0 (unlimited)
;
UserLimit     =   0

; GBytes, total storage quota for all users, default: 0 (unlimited)
;   if non-zero then it restricts the total storage size of all users
;
AllUsersLimit = 128

; Database cleanup script:
;   creates new model.sqlite database and copy model data
;
DbCleanup = etc/db-cleanup_linux.sh

; user groups can be created to simplify settings
;
Groups = Low, High, Others

[Low]
Users      = localhost_4040, bob, alice
UserLimit  = 2

[High]
Users      = king, boss, cheif
UserLimit  = 20

[king]
UserLimit  = 100 ; override storage settings for oms instance "king"

; "me" is not a member of any group
;
[me]
UserLimit  = 0 ; unlimited

Google cloud: front-end server and and auto scale of multiple back-end servers

There is a small front-end server with 4 cores and 4 back-end servers: cpc-1, cpc-2, cpc-3, cpc-4 with 16 cores each. You are using public cloud and want to pay only for actual usage of back end servers:

  • server(s) must be started automatically when user (Alice or Bob) want to run the model;
  • server(s) must stop after model run completed to reduce cloud cost

Scripts below are also available at our GitHub↗

[Common]
LocalCpu      = 4       ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory   = 0       ; gigabytes, localhost memory limit, zero means no limits
MpiMaxThreads = 8       ; max number of modelling threads per MPI process
MaxErrors     = 10      ; errors threshold for compute server or cluster
IdleTimeout   = 900     ; seconds, idle time before stopping server or cluster
StartTimeout  = 180     ; seconds, max time to start server or cluster
StopTimeout   = 180     ; seconds, max time to stop server or cluster

Servers   = cpc-1, cpc-2, cpc-3, cpc-4  ; computational servers or clusters

StartExe  = /bin/bash                  ; default executable to start server
StopExe   = /bin/bash                  ; default executable to stop server
ArgsBreak = -@-                        ; arguments delimiter in StartArgs or StopArgs line
                                       ; delimiter can NOT contain ; or # chars, which are reserved for # comments
                                       ; it can be any other delimiter of your choice, e.g.: +++
; StartArgs = ../etc/compute-start.sh    ; default command line arguments to start server, server name will be appended
; StopArgs  = ../etc/compute-stop.sh     ; default command line arguments to start server, server name will be appended

[cpc-1]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-b-@-cpc-1
StopArgs  = ../etc/compute-stop-4.sh-@-us-zone-b-@-cpc-1

[cpc-2]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-c-@-cpc-2
StopArgs  = ../etc/compute-stop-4.sh-@-us-zone-c-@-cpc-2

[cpc-3]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-d-@-cpc-3
StopArgs  = ../etc/compute-stop-4.sh-@-us-zone-d-@-cpc-3

[cpc-4]
Cpu       = 16          ; default: 1 CPU core
Memory    = 0           ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-a-@-cpc-4
StopArgs  = ../etc/compute-stop-4.sh-@-us-zone-a-@-cpc-4

; OpenMPI hostfile
;
; cpm   slots=1 max_slots=1
; cpc-1 slots=2
; cpc-3 slots=4
;
[hostfile]
HostFileDir = models/log
HostName = @-HOST-@
CpuCores = @-CORES-@
RootLine = cpm slots=1 max_slots=1
HostLine = @-HOST-@ slots=@-CORES-@

; MS-MPI machinefile (on Windows with Microsoft MPI)
;
; cpm:1
; cpc-1:2
; cpc-3:4
;
; [hostfile]
; HostFileDir = models\log
; HostName = @-HOST-@
; CpuCores = @-CORES-@
; RootLine = cpm:1
; HostLine = @-HOST-@:@-CORES-@

Oms is using StartExe and StartArgs in order to start each server. On Linux result of above job.ini is:

/bin/bash etc/compute-start.sh cpc-1

On Windows you can use cmd or PowerShell in order to control servers. Related part of job.ini can look like:

StartExe  = cmd                         ; default executable to start server
StartArgs = /C-@-etc\compute-start.bat  ; default command line arguments to start server, server name will be appended
StopExe   = cmd                         ; default executable to stop server
StopArgs  = /C-@-etc\compute-stop.bat   ; default command line arguments to start server, server name will be appended

which result in following command to start server:

cmd /C etc\compute-start.bat cpc-1

Start and stop scripts can look like (Google cloud version):

#!/bin/bash
#
# start computational server, run as: 
#
# sudo -u $USER-NAME compute-start.sh host-name

srv_zone="us-zone-b"
srv_name="$1"

if [ -z "$srv_name" ] || [ -z "$srv_zone" ] ;
then
  echo "ERROR: invalid (empty) server name or zone: $srv_name $srv_zone"
  exit 1
fi

gcloud compute instances start $srv_name --zone $srv_zone
status=$?

if [ $status -ne 0 ];
then
  echo "ERROR $status at start of: $srv_name"
  exit $status
fi

# wait until MPI is ready

for i in 1 2 3 4; do

  sleep 10

  echo "[$i] mpirun -n 1 -H $srv_name hostname"

  mpirun -n 1 -H $srv_name hostname
  status=$?

  if [ $status -eq 0 ] ; then break; fi
done

if [ $status -ne 0 ];
then
  echo "ERROR $status from MPI at start of: $srv_name"
  exit $status
fi

echo "Start OK: $srv_name"
#!/bin/bash
#
# stop computational server, run as: 
#
# sudo -u $USER-NAME compute-stop.sh host-name

# set -e

srv_zone="us-zone-b"
srv_name="$1"

if [ -z "$srv_name" ] || [ -z "$srv_zone" ] ;
then
  echo "ERROR: invalid (empty) server name or zone: $srv_name $srv_zone"
  exit 1
fi

for i in 1 2 3 4 5 6 7; do

  gcloud compute instances stop $srv_name --zone $srv_zone
  status=$?

  if [ $status -eq 0 ] ; then break; fi

  sleep 10
done

if [ $status -ne 0 ];
then
  echo "ERROR $status at stop of: $srv_name"
  exit $status
fi

echo "Stop OK: $srv_name"

Azure cloud: front-end server and and auto scale of multiple back-end servers

There is a small front-end server with 4 cores and 2 back-end servers: dc1, dc2 with 4 cores each. You are using public cloud and want to pay only for actual usage of back end servers:

  • server(s) must be started automatically when user (Alice or Bob) want to run the model;
  • server(s) must stop after model run completed to reduce cloud cost

Scripts below are also available at our GitHub↗

[Common]
LocalCpu      = 4     ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory   = 0     ; gigabytes, localhost memory limit, zero means unlimited
MpiMaxThreads = 8     ; max number of modelling threads per MPI process
MaxErrors     = 10    ; errors threshold for compute server or cluster
IdleTimeout   = 900   ; seconds, idle time before stopping server or cluster
StartTimeout  = 90    ; seconds, max time to start server or cluster
StopTimeout   = 90    ; seconds, max time to stop server or cluster

Servers   = dc1, dc2     ; computational servers or clusters for MPI jobs

StartExe  = /bin/bash                       ; default executable to start server
StopExe   = /bin/bash                       ; default executable to stopt server
StartArgs = ../etc/az-start.sh-@-dm_group   ; default command line arguments to start server, server name will be appended
StopArgs  = ../etc/az-stop.sh-@-dm_group    ; default command line arguments to start server, server name will be appended

ArgsBreak = -@-                    ; arguments delimiter in StartArgs or StopArgs line
                                   ; delimiter can NOT contain ; or # chars, which are reserved for # comments
                                   ; it can be any other delimiter of your choice, e.g.: +++

[dc1]
Cpu    = 4    ; default: 1 CPU core
Memory = 0

[dc2]
Cpu    = 4    ; default: 1 CPU core
Memory = 0

; OpenMPI hostfile
;
; dcm slots=1 max_slots=1
; dc1 slots=2
; dc2 slots=4
;
[hostfile]
HostFileDir = models/log
HostName    = @-HOST-@
CpuCores    = @-CORES-@
RootLine    = dm slots=1 max_slots=1
HostLine    = @-HOST-@ slots=@-CORES-@

Oms is using StartExe and StartArgs in order to start each server. On Linux result of above job.ini is similar to:

/bin/bash etc/az-start.sh dm_group dc1

Start and stop scripts can look like (Azure cloud version):

#!/bin/bash
#
# start Azure server, run as: 
#
# sudo -u $USER-NAME az-start.sh resource-group host-name

# set -e

res_group="$1"
srv_name="$2"

if [ -z "$srv_name" ] || [ -z "$res_group" ] ;
then
  echo "ERROR: invalid (empty) server name or resource group: $srv_name $res_group"
  exit 1
fi

# login

az login --identity
status=$?

if [ $status -ne 0 ];
then
  echo "ERROR $status from az login at start of: $res_group $srv_name"
  exit $status
fi

# Azure VM start 

az vm start -g "$res_group" -n "$srv_name"
status=$?

if [ $status -ne 0 ];
then
  echo "ERROR $status at: az vm start -g $res_group -n $srv_name"
  exit $status
fi

# wait until MPI is ready

for i in 1 2 3 4 5; do

  sleep 10

  echo "[$i] mpirun -n 1 -H $srv_name hostname"

  mpirun -n 1 -H $srv_name hostname
  status=$?

  if [ $status -eq 0 ] ; then break; fi
done

if [ $status -ne 0 ];
then
  echo "ERROR $status from MPI at start of: $srv_name"
  exit $status
fi

echo "Start OK: $srv_name"
#!/bin/bash
#
# stop Azure server, run as: 
#
# sudo -u $USER-NAME az-stop.sh resource-group host-name

# set -e

res_group="$1"
srv_name="$2"

if [ -z "$srv_name" ] || [ -z "$res_group" ] ;
then
  echo "ERROR: invalid (empty) server name or resource group: $srv_name $res_group"
  exit 1
fi

# login

az login --identity
status=$?

if [ $status -ne 0 ];
then
  echo "ERROR $status from az login at start of: $res_group $srv_name"
  exit $status
fi

# Azure VM stop

for i in 1 2 3 4; do

  az vm deallocate -g "$res_group" -n "$srv_name"

  if [ $status -eq 0 ] ; then break; fi

  sleep 10
done

if [ $status -ne 0 ];
then
  echo "ERROR $status at stop of: $srv_name"
  exit $status
fi

echo "Stop OK: $srv_name"

Linux cluster in cloud

Security consideration:

In wiki I am describing the most simple but least secure configuration, for your production environment you may want to:

  • use a separate web front-end server, separate oms control server with firewall in between
  • never use front-end web-server OS user as oms control server OS user
  • do not use the same OS user, like oms, but create a different for each of your model users, like Alice and Bob in example above.

Of course web front-end UI of your production environment must be protected by https:// with proper authentication and authorization. All that is out of scope of our wiki, please consult your organization security guidelines for it.

Also I am not describing here how to configure web-servers, how to create reverse proxy, install SSL certificates, etc. There are a lot of great materials on those topics around, just please think about security in a first place.

Cloud examples here assume Debian or Ubuntu Linux servers setup, you can use it for RedHat Linux with minimal adjustment. OpenM++ do support Microsoft Windows clusters, but configuring it is a more complex task and out of scope for that wiki.

Our simple cluster consist of from-end web-UI server with host name dm and multiple back-end computational servers: dc1, dc2,....

Front-end server OS setup

Front-end dm server must have some web-server installed, Apache or nginx for example, static IP and DNS records for your domain.

Choose Debian-11, Ubuntu 22.04 or RedHat 9 (Rocky, AlmaLinux) as your base system and create dm cloud virtual machine, at least 4 cores recommended. We will create two disks on dm: boot disk and fast SSD data disk where all users data and models are stored.

Set timezone, install openMPI and (optional) SQLite:

sudo timedatectl set-timezone America/Toronto

sudo apt-get install openmpi-bin
sudo apt-get install sqlite3

# check result:
mpirun hostname -A

Create and mount on /mirror SSD data disk to store all users data and models:

# init new SSD, use lsblk to find which /dev it is
lsblk

sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sda

sudo mkdir /mirror
sudo mount -o discard,defaults /dev/sda /mirror

# check results:
ls -la /mirror

# add new disk to fstab, mount by UUID:
sudo blkid /dev/sda
sudo nano /etc/fstab

# add your UUID mount:
UUID=98765432-d09a-4936-b85f-a61da123456789 /mirror ext4 discard,defaults 0 2

Create NFS shares:

sudo mkdir -p /mirror/home
sudo mkdir -p /mirror/data

sudo apt install nfs-kernel-server

# add shares into exports:
sudo nano /etc/exports

# export user homes and data, data can be exported read-only, rw is not required
/mirror/home *(rw,sync,no_root_squash,no_subtree_check)
/mirror/data *(rw,sync,no_root_squash,no_subtree_check)

sudo systemctl restart nfs-kernel-server

# check results:
/sbin/showmount -e dm

systemctl status nfs-kernel-server

Create 'oms' service account, login disabled. I am using 1108 as user id and group id, but it is an example only and 1108 have no special meaning:

export OMS_UID=1108
export OMS_GID=1108

sudo addgroup --gid $OMS_GID oms
sudo adduser --home /mirror/home/oms --disabled-password --gecos "" --gid $OMS_GID -u $OMS_UID oms

sudo chown -R oms:oms /mirror/data

# increase stack size for models to 65 MB = 65536

sudo -u oms nano /mirror/home/oms/.bashrc

# ~/.bashrc: executed by bash(1) for non-login shells.
# openM++
# some models require stack size:
#
ulimit -S -s 65536

#
# end of openM++

Password-less ssh for oms service account:

sudo su -l oms
cd ~

mkdir .ssh

ssh-keygen -f .ssh/id_rsa -t rsa -N '' -C oms

# create .ssh/config with content below:
nano .ssh/config

Host *
    StrictHostKeyChecking no
    UserKnownHostsFile /dev/null
    LogLevel ERROR

cp -p .ssh/id_rsa.pub .ssh/authorized_keys

chmod 700 .ssh
chmod 600 .ssh/id_rsa
chmod 644 .ssh/id_rsa.pub
chmod 644 .ssh/config
chmod 644 .ssh/authorized_keys

exit    # logout from 'oms' user

# check ssh for oms user, it should work without any prompts, without any Yes/No questions:

sudo -u oms ssh dm

Check openMPI under 'oms' service account:

sudo -u oms mpirun hostname
sudo -u oms mpirun -H dm hostname

Done with dm server OS setup, reboot it and start dc1, dc2,... creating back-end servers.

Back-end computational servers setup

I am describing it for dc1, assuming you will create base image from it and use for all other back-end servers. On Azure it is make sense to create virtual machine scale set instead of individual servers.

Choose Debian-11, Ubuntu 22.04 or RedHat 9 (Rocky, AlmaLinux) as your base system and create dc1 cloud virtual machine, at least 16 cores recommended. It does not require a fast SSD, use regular small HDD because there are no model data stored in back-end, it is only OS boot disk, nothing else. Back-end servers should not be visible from the internet, it should be visible only from front-end dm server.

Set timezone and install openMPI::

sudo timedatectl set-timezone America/Toronto

sudo apt-get install openmpi-bin

# check result:
mpirun hostname -A

Mount NFS shares from dm server:

sudo mkdir -p /mirror/home
sudo mkdir -p /mirror/data

sudo apt install nfs-common

/sbin/showmount -e dm

sudo mount -t nfs dm:/mirror/home /mirror/home
sudo mount -t nfs dm:/mirror/data /mirror/data

systemctl status mirror-home.mount
systemctl status mirror-data.mount

# if above OK then add nfs share mounts into fstab:

sudo nano /etc/fstab

# fstab records:
dm:/mirror/home /mirror/home nfs defaults 0 0
dm:/mirror/data /mirror/data nfs defaults 0 0

# (optional) reboot node and make sure shares are mounted:

systemctl status mirror-home.mount
systemctl status mirror-data.mount

Create 'oms' service account, login disabled. It must have exactly the same user id and group id as oms user on dm, I am using 1108 as an example:

export OMS_UID=1108
export OMS_GID=1108

sudo /sbin/addgroup --gid $OMS_GID oms
sudo adduser --no-create-home --home /mirror/home/oms --disabled-password --gecos "" --gid $OMS_GID -u $OMS_UID oms

# check 'oms' sevice account access to shared files:

sudo -u oms -- ls -la /mirror/home/oms/.ssh/

Optional: if you are using Azure virtual machine scale set then cloud.init config can be:

#cloud-config
#
runcmd:
 - addgroup --gid 1108 oms
 - adduser --no-create-home --home /mirror/home/oms --disabled-password --gecos "" --gid 1108 -u 1108 oms

Check openMPI under 'oms' service account:

sudo -u oms mpirun hostname
sudo -u oms mpirun -H dc1 hostname
sudo -u oms mpirun -H dm hostname

Done with dc1 OS setup, clone it for all other back-end servers. After you created all back-end servers check openMPI from entire cluster, for example:

sudo -u oms mpirun -H dm,dc1,dc2,dc3,dc4,dc5,dc6,dc7,dc8,dc9,dc10 hostname

Now login back to your dm front-end and create standard openM++ directory structure at /mirror/data/, copy models, create user directories as it is described for "users" Alice and Bob above. Bob and Alice are your model users, they should not have OS login, user oms with disabled login is used to run the models on behalf of Alice and Bob. I would also recommend to have at least one "user" for your own tests, to verify system status and test and run the models when you publish it. For that I am usually creating "user" test.

/mirror/data/
    bin/
        oms    -> oms web service executable
        dbcopy -> dbcopy utility executable
    html/    -> web-UI directory with HTML, js, css, images...
    etc/     -> config files directory, contain template(s) to run models
    log/     -> recommended log files directory
    alice/   -> user Alice "root" directory
        log/     -> recommended Alice's log files directory
        models/
              bin/  -> Alice's model.exe and model.sqlite directory
              log/  -> Alice's directory for models run log files
              doc/  -> models documentation directory
              home/ -> Alice's personal home directory
                  io/download  -> Alice's directory for download files
                  io/upload    -> Alice's directory to upload files
    bob/     -> user Bob "root" directory
        log/     -> recommended Bob's log files directory
        models/
              bin/  -> Bob's model.exe and model.sqlite directory
              log/  -> Bob's directory for models run log files
              doc/  -> models documentation directory
              home/ -> Bob's personal home directory
                  io/download  -> Bob's directory for download files
                  io/upload    -> Bob's directory to upload files
    job/  -> model run jobs control directory, it must be shared between all users
          job.ini   -> (optional) job control settings
          disk.ini  -> (optional) disk usage control settings to set storage quotas Bob and Alice
          active/   -> active model run state files
          history/  -> model run history files
          past/     -> (optional) shadow copy of history folder, invisible to the end user
          queue/    -> model run queue files
          state/    -> jobs state and computational servers state files
    oms/    -> oms init.d files, see examples on our GitHub
    oms.ini -> oms config, see content above
    test/   -> user test "root" directory, for admin internal use
            -> .... user test subdirectories here

Above there is also oms/ directory with init.d files to restart oms when front-end dm server is rebooted. You can find examples of it at our GitHub↗.

⚠️ **GitHub.com Fallback** ⚠️