Oms in cloud - openmpp/openmpp.github.io GitHub Wiki
OpenM++ web-service (oms) can provide basic computational resources management for your local computer or cluster of servers on local network or in cloud. It can manage model runs queue if your computational resources (CPU and memory) are limited and also can automatically start and stop cloud servers.
Examples below assuming you are familiar with basics of Oms: openM++ web-service.
If you want to have model runs queue, or using openM++ in cloud and want automatically scale up and down cloud resources,
e.g. start and stop virtual machines for model runs then start oms
with job control option:
oms -oms.JobDir job
Following directory structure expected:
./ -> oms "root" directory, by default it is current directory
html/ -> web-UI directory with HTML, js, css, images...
disk.ini -> (optional) disk usage control settings to set storage quotas
etc/ -> config files directory, contain template(s) to run models
log/ -> recommended log files directory
models/
bin/ -> default model.exe and model.sqlite directory
log/ -> default directory for models run log files
doc/ -> models documentation directory
home/ -> user personal home directory
io/download -> user directory for download files
io/upload -> user directory to upload files
job/ -> model run jobs control directory
job.ini -> job control settings
active/ -> active model run state files
history/ -> model run history files
past/ -> (optional) shadow copy of history folder, invisible to the end user
queue/ -> model run queue files
state/ -> jobs state and computational servers state files
jobs.queue.paused -> if such file exists then jobs queue is paused
jobs.queue.all.paused -> if such file exists then all jobs in all queues are paused
By default oms
assumes:
- all models are running on
localhost
- there are no limits on CPU cores or memory usage
You can create model run queue on your local computer by setting a limit on number of CPU cores available.
To do it modify job.ini
file in a job
directory, for example:
[Common]
LocalCpu = 8 ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory = 0 ; gigabytes, localhost memory limit, zero means no limits
You don't have to set memory limits until model run memory requirements are known.
CPU cores which are you limiting in job.ini
does not need to be an actual cores.
You can have 8 cores on your PC and set LocalCpu = 16
which allow 200% overload and may significantly slow down your local machine.
Or if you set LocalCpu = 4
then your models would be able to use only half of actual cores.
Example of local network (LAN) cluster:
- small front-end server with 4 cores
- 4 back-end servers: cpc-1, cpc-2, cpc-3, cpc-4 with 16 cores each
[Common]
LocalCpu = 4 ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory = 0 ; gigabytes, localhost memory limit, zero means no limits
MpiCpu = 40 ; max MPI cpu cores available for each oms instance, zero means oms instances can use all cpu's available
MpiMemory = 0 ; gigabytes, max MPI memory available for each oms instance, zero means oms instances can use all memory available
MpiMaxThreads = 8 ; max number of modelling threads per MPI process
MaxErrors = 10 ; errors threshold for compute server or cluster
Servers = cpc-1, cpc-2, cpc-3, cpc-4 ; computational servers or clusters
[cpc-1]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
[cpc-2]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
[cpc-3]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
[cpc-4]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
; OpenMPI hostfile (on Linux)
;
; cpm slots=1 max_slots=1
; cpc-1 slots=2
; cpc-3 slots=4
;
[hostfile]
HostFileDir = models/log
HostName = @-HOST-@
CpuCores = @-CORES-@
RootLine = cpm slots=1 max_slots=1
HostLine = @-HOST-@ slots=@-CORES-@
; MS-MPI machinefile (on Windows with Microsoft MPI)
;
; cpm:1
; cpc-1:2
; cpc-3:4
;
; [hostfile]
; HostFileDir = models\log
; HostName = @-HOST-@
; CpuCores = @-CORES-@
; RootLine = cpm:1
; HostLine = @-HOST-@:@-CORES-@
Based on job.ini
above oms will create MPI hostfile
with back-end servers assignment for each particular model run.
In order to use that hostfile
you should modify model run template(s) in openM++ etc/
directory.
For example on Linux with openMPI:
{{/*
oms web-service:
Template to run modelName_mpi executable on Linux using OpenMPI
It is not recommended to use root process for modelling
Oms web-service using template for exec.Command(exeName, Args...):
- skip empty lines
- substitute template arguments
- first non-empty line is a name of executable to run
- each other line is a command line argument for executable
Arguments of template:
ModelName string // model name
ExeStem string // base part of model exe name, usually modelName
Dir string // work directory to run the model
BinDir string // bin directory where model exe is located
MpiNp int // number of MPI processes
HostFile string // if not empty then path to hostfile
Args []string // model command line arguments
Env map[string]string // environment variables to run the model
Example of result:
mpirun --hostfile host.ini --bind-to none --oversubscribe -wdir models/bin -x key=value ./modelName_mpi -OpenM.LogToFile false
*/}}
mpirun
--bind-to
none
--oversubscribe
{{with .HostFile}}
--hostfile
{{.}}
{{end}}
{{with .Dir}}
-wdir
{{.}}
{{end}}
{{range $key, $val := .Env}}
-x
{{$key}}={{$val}}
{{end}}
{{.BinDir}}/{{.ExeStem}}_mpi
{{range .Args}}
{{.}}
{{end}}
Note: If you are using OpenMPI then it is a good idea to have --oversubscribe --bind-to none
as above in order to avoid MPI models run failure or performance degradation.
If you are using Microsoft MPI on Windows servers then modify etc\
model template file(s) to have it similar to:
{{/*
oms web-service:
Template to run modelName_mpi.exe on Windows Microsoft MPI using machinefile
To use this template rename it into:
mpi.ModelRun.template.txt
Oms web-service using template for exec.Command(exeName, Args...):
- skip empty lines
- substitute template arguments
- first non-empty line is a name of executable to run
- each other line is a command line argument for executable
Arguments of template:
ModelName string // model name
ExeStem string // base part of model exe name, usually modelName
Dir string // work directory to run the model
BinDir string // bin directory where model exe is located
DbPath string // absolute path to sqlite database file: models/bin/model.sqlite
MpiNp int // number of MPI processes
HostFile string // if not empty then path to hostfile
Args []string // model command line arguments
Env map[string]string // environment variables to run the model
Example of result:
mpiexec -machinefile hosts.ini -wdir models\bin -env key value ..\bin\modelName_mpi -OpenM.LogToFile false
*/}}
mpiexec
{{with .HostFile}}
-machinefile
{{.}}
{{end}}
{{with .Dir}}
-wdir
{{.}}
{{end}}
{{range $key, $val := .Env}}
-env
{{$key}}
{{$val}}
{{end}}
{{.BinDir}}\{{.ExeStem}}_mpi
{{range .Args}}
{{.}}
{{end}}
Use oms
jobs control abilities to organize model runs queue and, if required, automatically scale up down cloud resources, e.g.: start and stop virtual machines or nodes.
For example, if you want to have two users: Alice and Bob who are running models then start oms
as:
bin/oms -l localhost:4050 -oms.RootDir alice -oms.Name alice -ini oms.ini
bin/oms -l localhost:4060 -oms.RootDir bob -oms.Name bob -ini oms.ini
where content of oms.ini
is:
[oms]
JobDir = ../job
EtcDir = ../etc
HomeDir = models/home
AllowDownload = true
AllowUpload = true
LogRequest = true
[OpenM]
LogFilePath = log/oms.log
LogToFile = true
LogUseDailyStamp = true
LogToConsole = false
Above assume following directory structure:
./ -> current directory
bin/
oms -> oms web service executable, on Windows: `oms.exe`
dbcopy -> dbcopy utility executable, on Windows: `dbcopy.exe`
html/ -> web-UI directory with HTML, js, css, images...
disk.ini -> (optional) disk usage control settings to set storage quotas for Bob and Alice
etc/ -> config files directory, contain template(s) to run models
alice/ -> user Alice "root" directory
log/ -> recommended Alice's log files directory
models/
bin/ -> Alice's model.exe and model.sqlite directory
log/ -> Alice's directory for models run log files
doc/ -> models documentation directory
home/ -> Alice's personal home directory
io/download -> Alice's directory for download files
io/upload -> Alice's directory to upload files
bob/ -> user Bob "root" directory
log/ -> recommended Bob's log files directory
models/
bin/ -> Bob's model.exe and model.sqlite directory
log/ -> Bob's directory for models run log files
doc/ -> models documentation directory
home/ -> Bob's personal home directory
io/download -> Bob's directory for download files
io/upload -> Bob's directory to upload files
job/ -> model run jobs control directory, it must be shared between all users
job.ini -> (optional) job control settings
active/ -> active model run state files
history/ -> model run history files
past/ -> (optional) shadow copy of history folder, invisible to the end user
queue/ -> model run queue files
state/ -> jobs state and computational servers state files
jobs.queue.paused -> if such file exists then jobs queue is paused
jobs.queue.all.paused -> if such file exists then all jobs in all queues are paused
You don't have to follow that directory structure, it is flexible and can be customized through oms
run options.
IMPORTANT: Job directory must be in a SHARED location and accessible to all users who are using the same queue and the same computational resources (servers, nodes, clusters).
You don't need to create OS users, e.g. Alice and Bob does not need a login accounts on your server (cloud, Active Directory, etc.). All you need is to setup some authentication mechanism and reverse proxy which would allow Alice to access localhost:4050
and Bob localhost:4060
on your front-end. Actual OS user can have any name, e.g. oms
:
sudo -u oms OM_ROOT=/shared/alice bash -c 'source ~/.bashrc; bin/oms -l localhost:4050 -oms.RootDir alice -oms.Name alice -ini oms.ini &'
sudo -u oms OM_ROOT=/shared/bob bash -c 'source ~/.bashrc; bin/oms -l localhost:4060 -oms.RootDir bob -oms.Name bob -ini oms.ini &'
You may want to set the limits on disk space usage and enforce storage cleanup by users. It can be done through etc/disk.ini
file.
If etc/disk.ini
exists then oms
web-service will monitor and report disk usage by user(s) and may set a limit on storage space.
You can set a limit for individual user, group of users and grand total space limit on storage space used by all users.
If user exceeding disk space quotas then she/he cannot run the model or upload files to cloud, only download is available.
User can Cleanup Disk Space through UI.
Example of disk.ini
:
; Example of storage usage control settings
; "user" term below means oms instance
; "user name" is oms instance name, for example: "localhost_4040"
;
; if etc/disk.ini file exists then storage usage control is active
[Common]
; seconds, storage scan interval, if too small then default value used
;
ScanInterval = 0
; GBytes, user storage quota, default: 0 (unlimited)
;
UserLimit = 0
; GBytes, total storage quota for all users, default: 0 (unlimited)
; if non-zero then it restricts the total storage size of all users
;
AllUsersLimit = 128
; Database cleanup script:
; creates new model.sqlite database and copy model data
;
DbCleanup = etc/db-cleanup_linux.sh
; user groups can be created to simplify settings
;
Groups = Low, High, Others
[Low]
Users = localhost_4040, bob, alice
UserLimit = 2
[High]
Users = king, boss, cheif
UserLimit = 20
[king]
UserLimit = 100 ; override storage settings for oms instance "king"
; "me" is not a member of any group
;
[me]
UserLimit = 0 ; unlimited
There is a small front-end server with 4 cores and 4 back-end servers: cpc-1, cpc-2, cpc-3, cpc-4 with 16 cores each. You are using public cloud and want to pay only for actual usage of back end servers:
- server(s) must be started automatically when user (Alice or Bob) want to run the model;
- server(s) must stop after model run completed to reduce cloud cost
Scripts below are also available at our GitHub↗
[Common]
LocalCpu = 4 ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory = 0 ; gigabytes, localhost memory limit, zero means no limits
MpiMaxThreads = 8 ; max number of modelling threads per MPI process
MaxErrors = 10 ; errors threshold for compute server or cluster
IdleTimeout = 900 ; seconds, idle time before stopping server or cluster
StartTimeout = 180 ; seconds, max time to start server or cluster
StopTimeout = 180 ; seconds, max time to stop server or cluster
Servers = cpc-1, cpc-2, cpc-3, cpc-4 ; computational servers or clusters
StartExe = /bin/bash ; default executable to start server, if empty then server is always ready, no startup
StopExe = /bin/bash ; default executable to stop server, if empty then server is always ready, no shutdown
ArgsBreak = -@- ; arguments delimiter in StartArgs or StopArgs line
; delimiter can NOT contain ; or # chars, which are reserved for # comments
; it can be any other delimiter of your choice, e.g.: +++
; StartArgs = ../etc/compute-start.sh ; default command line arguments to start server, server name will be appended
; StopArgs = ../etc/compute-stop.sh ; default command line arguments to stop server, server name will be appended
[cpc-1]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-b-@-cpc-1
StopArgs = ../etc/compute-stop-4.sh-@-us-zone-b-@-cpc-1
[cpc-2]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-c-@-cpc-2
StopArgs = ../etc/compute-stop-4.sh-@-us-zone-c-@-cpc-2
[cpc-3]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-d-@-cpc-3
StopArgs = ../etc/compute-stop-4.sh-@-us-zone-d-@-cpc-3
[cpc-4]
Cpu = 16 ; default: 1 CPU core
Memory = 0 ; zero means no limits
StartArgs = ../etc/compute-start-4.sh-@-us-zone-a-@-cpc-4
StopArgs = ../etc/compute-stop-4.sh-@-us-zone-a-@-cpc-4
; OpenMPI hostfile
;
; cpm slots=1 max_slots=1
; cpc-1 slots=2
; cpc-3 slots=4
;
[hostfile]
HostFileDir = models/log
HostName = @-HOST-@
CpuCores = @-CORES-@
RootLine = cpm slots=1 max_slots=1
HostLine = @-HOST-@ slots=@-CORES-@
; MS-MPI machinefile (on Windows with Microsoft MPI)
;
; cpm:1
; cpc-1:2
; cpc-3:4
;
; [hostfile]
; HostFileDir = models\log
; HostName = @-HOST-@
; CpuCores = @-CORES-@
; RootLine = cpm:1
; HostLine = @-HOST-@:@-CORES-@
Oms is using StartExe
and StartArgs
in order to start each server. On Linux result of above job.ini
is:
/bin/bash etc/compute-start.sh cpc-1
On Windows you can use cmd
or PowerShell in order to control servers. Related part of job.ini
can look like:
StartExe = cmd ; default executable to start server, if empty then server is always ready, no startup
StopExe = cmd ; default executable to stop server, if empty then server is always ready, no shutdown
StartArgs = /C-@-etc\compute-start.bat ; default command line arguments to start server, server name will be appended
StopArgs = /C-@-etc\compute-stop.bat ; default command line arguments to stop server, server name will be appended
which result in following command to start server:
cmd /C etc\compute-start.bat cpc-1
Start and stop scripts can look like (Google cloud version):
#!/bin/bash
#
# start computational server, run as:
#
# sudo -u $USER-NAME compute-start.sh host-name
srv_zone="us-zone-b"
srv_name="$1"
if [ -z "$srv_name" ] || [ -z "$srv_zone" ] ;
then
echo "ERROR: invalid (empty) server name or zone: $srv_name $srv_zone"
exit 1
fi
gcloud compute instances start $srv_name --zone $srv_zone
status=$?
if [ $status -ne 0 ];
then
echo "ERROR $status at start of: $srv_name"
exit $status
fi
# wait until MPI is ready
for i in 1 2 3 4; do
sleep 10
echo "[$i] mpirun -n 1 -H $srv_name hostname"
mpirun -n 1 -H $srv_name hostname
status=$?
if [ $status -eq 0 ] ; then break; fi
done
if [ $status -ne 0 ];
then
echo "ERROR $status from MPI at start of: $srv_name"
exit $status
fi
echo "Start OK: $srv_name"
#!/bin/bash
#
# stop computational server, run as:
#
# sudo -u $USER-NAME compute-stop.sh host-name
# set -e
srv_zone="us-zone-b"
srv_name="$1"
if [ -z "$srv_name" ] || [ -z "$srv_zone" ] ;
then
echo "ERROR: invalid (empty) server name or zone: $srv_name $srv_zone"
exit 1
fi
for i in 1 2 3 4 5 6 7; do
gcloud compute instances stop $srv_name --zone $srv_zone
status=$?
if [ $status -eq 0 ] ; then break; fi
sleep 10
done
if [ $status -ne 0 ];
then
echo "ERROR $status at stop of: $srv_name"
exit $status
fi
echo "Stop OK: $srv_name"
There is a small front-end server with 4 cores and 2 back-end servers: dc1, dc2 with 4 cores each. You are using public cloud and want to pay only for actual usage of back end servers:
- server(s) must be started automatically when user (Alice or Bob) want to run the model;
- server(s) must stop after model run completed to reduce cloud cost
Scripts below are also available at our GitHub↗
[Common]
LocalCpu = 4 ; localhost CPU cores limit, localhost limits are applied only to non-MPI jobs
LocalMemory = 0 ; gigabytes, localhost memory limit, zero means unlimited
MpiMaxThreads = 8 ; max number of modelling threads per MPI process
MaxErrors = 10 ; errors threshold for compute server or cluster
IdleTimeout = 900 ; seconds, idle time before stopping server or cluster
StartTimeout = 90 ; seconds, max time to start server or cluster
StopTimeout = 90 ; seconds, max time to stop server or cluster
Servers = dc1, dc2 ; computational servers or clusters for MPI jobs
StartExe = /bin/bash ; default executable to start server, if empty then server is always ready, no startup
StopExe = /bin/bash ; default executable to stop server, if empty then server is always ready, no shutdown
StartArgs = ../etc/az-start.sh-@-dm_group ; default command line arguments to start server, server name will be appended
StopArgs = ../etc/az-stop.sh-@-dm_group ; default command line arguments to stop server, server name will be appended
ArgsBreak = -@- ; arguments delimiter in StartArgs or StopArgs line
; delimiter can NOT contain ; or # chars, which are reserved for # comments
; it can be any other delimiter of your choice, e.g.: +++
[dc1]
Cpu = 4 ; default: 1 CPU core
Memory = 0
[dc2]
Cpu = 4 ; default: 1 CPU core
Memory = 0
; OpenMPI hostfile
;
; dcm slots=1 max_slots=1
; dc1 slots=2
; dc2 slots=4
;
[hostfile]
HostFileDir = models/log
HostName = @-HOST-@
CpuCores = @-CORES-@
RootLine = dm slots=1 max_slots=1
HostLine = @-HOST-@ slots=@-CORES-@
Oms is using StartExe
and StartArgs
in order to start each server. On Linux result of above job.ini
is similar to:
/bin/bash etc/az-start.sh dm_group dc1
Start and stop scripts can look like (Azure cloud version):
#!/bin/bash
#
# start Azure server, run as:
#
# sudo -u $USER-NAME az-start.sh resource-group host-name
# set -e
res_group="$1"
srv_name="$2"
if [ -z "$srv_name" ] || [ -z "$res_group" ] ;
then
echo "ERROR: invalid (empty) server name or resource group: $srv_name $res_group"
exit 1
fi
# login
az login --identity
status=$?
if [ $status -ne 0 ];
then
echo "ERROR $status from az login at start of: $res_group $srv_name"
exit $status
fi
# Azure VM start
az vm start -g "$res_group" -n "$srv_name"
status=$?
if [ $status -ne 0 ];
then
echo "ERROR $status at: az vm start -g $res_group -n $srv_name"
exit $status
fi
# wait until MPI is ready
for i in 1 2 3 4 5; do
sleep 10
echo "[$i] mpirun -n 1 -H $srv_name hostname"
mpirun -n 1 -H $srv_name hostname
status=$?
if [ $status -eq 0 ] ; then break; fi
done
if [ $status -ne 0 ];
then
echo "ERROR $status from MPI at start of: $srv_name"
exit $status
fi
echo "Start OK: $srv_name"
#!/bin/bash
#
# stop Azure server, run as:
#
# sudo -u $USER-NAME az-stop.sh resource-group host-name
# set -e
res_group="$1"
srv_name="$2"
if [ -z "$srv_name" ] || [ -z "$res_group" ] ;
then
echo "ERROR: invalid (empty) server name or resource group: $srv_name $res_group"
exit 1
fi
# login
az login --identity
status=$?
if [ $status -ne 0 ];
then
echo "ERROR $status from az login at start of: $res_group $srv_name"
exit $status
fi
# Azure VM stop
for i in 1 2 3 4; do
az vm deallocate -g "$res_group" -n "$srv_name"
if [ $status -eq 0 ] ; then break; fi
sleep 10
done
if [ $status -ne 0 ];
then
echo "ERROR $status at stop of: $srv_name"
exit $status
fi
echo "Stop OK: $srv_name"
Security consideration:
In wiki I am describing the most simple but least secure configuration, for your production environment you may want to:
- use a separate web front-end server, separate
oms
control server with firewall in between - never use front-end web-server OS user as
oms
control server OS user - do not use the same OS user, like
oms
, but create a different for each of your model users, like Alice and Bob in example above.
Of course web front-end UI of your production environment must be protected by https://
with proper authentication and authorization.
All that is out of scope of our wiki, please consult your organization security guidelines for it.
Also I am not describing here how to configure web-servers, how to create reverse proxy, install SSL certificates, etc. There are a lot of great materials on those topics around, just please think about security in a first place.
Cloud examples here assume Debian or Ubuntu Linux servers setup, you can use it for RedHat Linux with minimal adjustment. OpenM++ do support Microsoft Windows clusters, but configuring it is a more complex task and out of scope for that wiki.
Our simple cluster consist of from-end web-UI server with host name dm
and multiple back-end computational servers: dc1, dc2,...
.
Front-end server OS setup
Front-end dm
server must have some web-server installed, Apache or nginx for example, static IP and DNS records for your domain.
Choose Debian-11, Ubuntu 22.04 or RedHat 9 (Rocky, AlmaLinux) as your base system and create dm
cloud virtual machine, at least 4 cores recommended.
We will create two disks on dm
: boot disk and fast SSD data disk where all users data and models are stored.
Set timezone, install openMPI and (optional) SQLite:
sudo timedatectl set-timezone America/Toronto
sudo apt-get install openmpi-bin
sudo apt-get install sqlite3
# check result:
mpirun hostname -A
Create and mount on /mirror
SSD data disk to store all users data and models:
# init new SSD, use lsblk to find which /dev it is
lsblk
sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sda
sudo mkdir /mirror
sudo mount -o discard,defaults /dev/sda /mirror
# check results:
ls -la /mirror
# add new disk to fstab, mount by UUID:
sudo blkid /dev/sda
sudo nano /etc/fstab
# add your UUID mount:
UUID=98765432-d09a-4936-b85f-a61da123456789 /mirror ext4 discard,defaults 0 2
Create NFS shares:
sudo mkdir -p /mirror/home
sudo mkdir -p /mirror/data
sudo apt install nfs-kernel-server
# add shares into exports:
sudo nano /etc/exports
# export user homes and data, data can be exported read-only, rw is not required
/mirror/home *(rw,sync,no_root_squash,no_subtree_check)
/mirror/data *(rw,sync,no_root_squash,no_subtree_check)
sudo systemctl restart nfs-kernel-server
# check results:
/sbin/showmount -e dm
systemctl status nfs-kernel-server
Create 'oms' service account, login disabled. I am using 1108 as user id and group id, but it is an example only and 1108 have no special meaning:
export OMS_UID=1108
export OMS_GID=1108
sudo addgroup --gid $OMS_GID oms
sudo adduser --home /mirror/home/oms --disabled-password --gecos "" --gid $OMS_GID -u $OMS_UID oms
sudo chown -R oms:oms /mirror/data
# increase stack size for models to 65 MB = 65536
sudo -u oms nano /mirror/home/oms/.bashrc
# ~/.bashrc: executed by bash(1) for non-login shells.
# openM++
# some models require stack size:
#
ulimit -S -s 65536
#
# end of openM++
Password-less ssh for oms
service account:
sudo su -l oms
cd ~
mkdir .ssh
ssh-keygen -f .ssh/id_rsa -t rsa -N '' -C oms
# create .ssh/config with content below:
nano .ssh/config
Host *
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
cp -p .ssh/id_rsa.pub .ssh/authorized_keys
chmod 700 .ssh
chmod 600 .ssh/id_rsa
chmod 644 .ssh/id_rsa.pub
chmod 644 .ssh/config
chmod 644 .ssh/authorized_keys
exit # logout from 'oms' user
# check ssh for oms user, it should work without any prompts, without any Yes/No questions:
sudo -u oms ssh dm
Check openMPI under 'oms' service account:
sudo -u oms mpirun hostname
sudo -u oms mpirun -H dm hostname
Done with dm
server OS setup, reboot it and start dc1, dc2,...
creating back-end servers.
Back-end computational servers setup
I am describing it for dc1
, assuming you will create base image from it and use for all other back-end servers.
On Azure it is make sense to create virtual machine scale set instead of individual servers.
Choose Debian-11, Ubuntu 22.04 or RedHat 9 (Rocky, AlmaLinux) as your base system and create dc1
cloud virtual machine, at least 16 cores recommended.
It does not require a fast SSD, use regular small HDD because there are no model data stored in back-end, it is only OS boot disk, nothing else.
Back-end servers should not be visible from the internet, it should be visible only from front-end dm
server.
Set timezone and install openMPI::
sudo timedatectl set-timezone America/Toronto
sudo apt-get install openmpi-bin
# check result:
mpirun hostname -A
Mount NFS shares from dm
server:
sudo mkdir -p /mirror/home
sudo mkdir -p /mirror/data
sudo apt install nfs-common
/sbin/showmount -e dm
sudo mount -t nfs dm:/mirror/home /mirror/home
sudo mount -t nfs dm:/mirror/data /mirror/data
systemctl status mirror-home.mount
systemctl status mirror-data.mount
# if above OK then add nfs share mounts into fstab:
sudo nano /etc/fstab
# fstab records:
dm:/mirror/home /mirror/home nfs defaults 0 0
dm:/mirror/data /mirror/data nfs defaults 0 0
# (optional) reboot node and make sure shares are mounted:
systemctl status mirror-home.mount
systemctl status mirror-data.mount
Create 'oms' service account, login disabled.
It must have exactly the same user id and group id as oms
user on dm
, I am using 1108 as an example:
export OMS_UID=1108
export OMS_GID=1108
sudo /sbin/addgroup --gid $OMS_GID oms
sudo adduser --no-create-home --home /mirror/home/oms --disabled-password --gecos "" --gid $OMS_GID -u $OMS_UID oms
# check 'oms' sevice account access to shared files:
sudo -u oms -- ls -la /mirror/home/oms/.ssh/
Optional: if you are using Azure virtual machine scale set then cloud.init config can be:
#cloud-config
#
runcmd:
- addgroup --gid 1108 oms
- adduser --no-create-home --home /mirror/home/oms --disabled-password --gecos "" --gid 1108 -u 1108 oms
Check openMPI under 'oms' service account:
sudo -u oms mpirun hostname
sudo -u oms mpirun -H dc1 hostname
sudo -u oms mpirun -H dm hostname
Done with dc1
OS setup, clone it for all other back-end servers.
After you created all back-end servers check openMPI from entire cluster, for example:
sudo -u oms mpirun -H dm,dc1,dc2,dc3,dc4,dc5,dc6,dc7,dc8,dc9,dc10 hostname
Now login back to your dm
front-end and create standard openM++ directory structure at /mirror/data/
, copy models, create user directories as it is described for "users" Alice and Bob above.
Bob and Alice are your model users, they should not have OS login, user oms
with disabled login is used to run the models on behalf of Alice and Bob.
I would also recommend to have at least one "user" for your own tests, to verify system status and test and run the models when you publish it.
For that I am usually creating "user" test
.
/mirror/data/
bin/
oms -> oms web service executable
dbcopy -> dbcopy utility executable
html/ -> web-UI directory with HTML, js, css, images...
etc/ -> config files directory, contain template(s) to run models
disk.ini -> (optional) disk usage control settings to set storage quotas Bob and Alice
log/ -> recommended log files directory
alice/ -> user Alice "root" directory
log/ -> recommended Alice's log files directory
models/
bin/ -> Alice's model.exe and model.sqlite directory
log/ -> Alice's directory for models run log files
doc/ -> models documentation directory
home/ -> Alice's personal home directory
io/download -> Alice's directory for download files
io/upload -> Alice's directory to upload files
bob/ -> user Bob "root" directory
log/ -> recommended Bob's log files directory
models/
bin/ -> Bob's model.exe and model.sqlite directory
log/ -> Bob's directory for models run log files
doc/ -> models documentation directory
home/ -> Bob's personal home directory
io/download -> Bob's directory for download files
io/upload -> Bob's directory to upload files
job/ -> model run jobs control directory, it must be shared between all users
job.ini -> (optional) job control settings
active/ -> active model run state files
history/ -> model run history files
past/ -> (optional) shadow copy of history folder, invisible to the end user
queue/ -> model run queue files
state/ -> jobs state and computational servers state files
oms/ -> oms init.d files, see examples on our GitHub
oms.ini -> oms config, see content above
test/ -> user test "root" directory, for admin internal use
-> .... user test subdirectories here
Above there is also oms/
directory with init.d
files to restart oms
when front-end dm
server is rebooted.
You can find examples of it at our GitHub↗.