OptClim‐UKESM: demonstration - optclim/ModelOptimisation GitHub Wiki

Overview of a demo example

This is an example which generates all the files and processes - except for not running the model itself. So it is a lightweight test. It has dummy code for simulated observables, which merely copy the parameter values and iterate towards a defined target.

Anatomy of a study

On PUMA2

  • the base suite, as well as each clone in ~/roses. ** copmrises the config used by ARCHER2
  • A script that us run from ARCHER2 by ssh, to clone the base suite for each new model instance, /home/n02/n02/mjmn02/dev/ModelOptimisation/Rose/onPUMA/launchArun.sh
  • directories used by Rose/Cylc: ~/roses for each rose suite; ~/cylc-run for each running suite (A cylc suite is derived from a rose suite, when suite-run is invoked). In each ~/cylc-run/ exist log files for each task in an executed workflow

on ARCHER2

  • cylc directory for each suite with tasks that run on ARCHER2: ~/cylc-run, itself linked to the /work/... filespace
  • definition of the OptClim study
  • files that are input and output from the study

Please explore the directory /work/n02/shared/mjmn02/OptClim/optclim3/studies/UKESM_P2/demo

The input/setup files for OptClim comprise, in this test example:

element function
top studies' directory in Archer2 /work/n02/shared/mjmn02/OptClim/optclim2A/studies/UKESM_P2/demo
JSON to define the study demo1.json

The JSON defines

  • study data ** maxfun - maximum number of model instances.
  • parameters
  • simulated observables
  • optimisation method and options
  • base suite name, under "study": "referenceModelDirectory":. This base suite exists on puma should be in your roses directory

The directories and files created by OptClim are

role location notes
directory for the study ./dummy This comes from the Study Name (Name) in the json config file
interface directory to connect each model run to OptClim dn001,2... The in initial two letters are in the JSON, "baseRunID". Each interface directory includes links to the corresponding Cylc directory
file to hold a new clones' parameter values dn00?/runParams.json Generated from code called by the optimiser, in UKESM.py. These are read when the cloned suite starts so the modifications can be made to the rose suite config files.
simobs for each run dn00?/observations.json The user provides code to generate this file for each run, from the model-generated data
json file with parameters and simobs collated demo1_final.json Generated by the runAlgorithm top-level, shows the results for each run to date. Adds costd - the closeness of fit at the end of the study.

Note that this example is really to test the OptClim framework. The simobs are very artificial, in these cases not even being derived from model output, but derived from each parameter so a simobs called iau_nontrop_max_p_tst is derived from the iau_nontrop_max_p parameter value.

After a few runs (an initial set of 3, then a 4th run) the ...final.json file includes:

...         "targets": {
        "comment": "Observed targets for optimisation. Should include constraint value.",
        "comment_UKESM": "default run values less a bit",
        "iau_nontrop_max_p_tst": 40000.0,
        "diagcloud_qn_compregimelimit_tst": 20
    },
...
 "standardModel": {
        "SimulatedValues": {
            "comment": "Values from default run -- used for display",
            "iau_nontrop_max_p_tst": 40000.0,
            "diagcloud_qn_compregimelimit_tst": 20
        }
    },
    "costd": {
        "ea001": 268700.5769104339,
        "ea002": 219952.6354286304,
        "ea003": 268700.5768846063,
        "ea004": 171204.6939543423,
        "ea005": 2.976319289236364,
        "ea006": 1.9036052185533093e-10
    },
    "bestEval": "ea006",
    "parameters": {
        "iau_nontrop_max_p": {
            "ea001": 420000.0,
            "ea002": 351060.0,
            "ea003": 420000.0,
            "ea004": 282120.00000000006,
            "ea005": 39999.999874744695,
            "ea006": 40000.00000000027
        },
        "diagcloud_qn_compregimelimit": {
            "ea001": 28.0,
            "ea002": 28.0,
            "ea003": 26.020000000000003,
            "ea004": 27.99999999853747,
            "ea005": 24.209151102927052,
            "ea006": 19.999999999999996
        }
    },
    "simObs": {
        "iau_nontrop_max_p_tst": {
            "ea001": 420000.0,
            "ea002": 351060.0,
            "ea003": 420000.0,
            "ea004": 282120.00000000006,
            "ea005": 39999.999874744695,
            "ea006": 40000.00000000027
        },
        "diagcloud_qn_compregimelimit_tst": {
            "ea001": 28.0,
            "ea002": 28.0,
            "ea003": 26.020000000000003,
            "ea004": 27.99999999853747,
            "ea005": 24.209151102927052,
            "ea006": 19.999999999999996
        }

The parameters generated by the runAlgorithm.py are held in files runParams.json, and are then applied to the atmos suite config file, from which they are passeed (by the UM code) into the namelist. So if two such runs' namelists are compared:

cd work/n02/shared/mjmn02/OptClim/optclim2A/studies/UKESM_P2/demo/dummy/dm001

 dm001> cat runParams.json 
{
    "iau_nontrop_max_p": 420000.0,
    "diagcloud_qn_compregimelimit": 28.0
}ln03@ dm001> cat ../dm002/runParams.json 
{
    "iau_nontrop_max_p": 351060.0,
    "diagcloud_qn_compregimelimit": 28.0

ln03@ dm001> diff cylc_dir/app/um/rose-app.conf ../dm002/cylc_dir/app/um/rose-app.conf
452c452
< iau_nontrop_max_p=420000.0
---
> iau_nontrop_max_p=351060.0

(db-168 did not use the above parameters in a proper science run - but the dummy model has them enabled to allow comparison of files in the de00?/cylc_dir/app/um/rose-app.conf files.)

Outline of the base suite

This is held on puma2 where it is cloned for each run. It comprises a standard UKESM suite (atmos only run) with amendments to allow testing quickly: the model is not executed; the named parameters are enabled

rose-suite.config - defines variables for the suite

The definitions in the base job config file, added due to OptClim, are the following: ''' $ grep OPT rose-suite.conf TASK_OPTCLIM=true OPT_SOURCE_SHARE='/work/n02/n02/tetts/cylc-run/u-db167/share/' OPTCLIM_RUNDIR='xxxx' OPTCLIM_PARAM_EDIT="/work/n02/shared/mjmn02/sw/conda/opt_4/bin/modeloptimisation2-create" ''' where

variable purpose value
OPT_SOURCE_SHARE location of the preexisting executables - on ARHCER2 '/work/n02/n02/tetts/cylc-run/u-db167/share/'
OPTCLIM_RUNDIR run directory, the cloning script replaces the xxxx by the cloned suite name 'xxxx'
OPTCLIM_PARAM_EDIT python utility, explained further below "/work/n02/shared/mjmn02/sw/conda/opt_4/bin/modeloptimisation2-create"

Cylc scheduling graph

This is the sequence of tasks run from Cylc for each model:

optclim_reuse => optclim_prerun => fakemodel => optclim_postrun 

In practice its often a bit more messy than that as multiple jobs (cycles) within a model instance are accommodated, and some other setup jobs run related to ancils, recon. Once the demo model is running, cycl gscan, then with right click on one of the running model shows the state of that suite.

optclim_reuse - runs on Archer2

copy bin files that comprise the prexisting executable
(IF the model is built with previous executables already in the base model, then remove this from  the graph in the suite.rc of the base model)

optclim_prerun - runs on Archer2

create links from the interface directory to the CYLC directory where output data from a model instance reside
run the script to amend parameters

optclim_postrun task - runs on Archer2

Runs the script optclim_finished which releases the held array job for this run.
Set up a "history" link to the location of the model output
Update the run's "state" to "finished"

The fakemodel types Hello World and stops - see the log file on puma.

In each interface directory there is also a state file to allow recognition of the models that need to be cloned and run before PUMA was upgraded to permit ssh onto it form ARCHER2.... may be able to drop it now.... not sure so leave well alone for now.

Generating simulated observables

An array job is created for each model instance, by the OptClim framework in config.py. It is released when the optclim_postrun task in the model suite is run, calling optclim_finished in that run's interface directory The json config file for the study specifies the name of a shell script that is called from this array job.

The user script for calculating simulated observables:

  • sets its environment
  • calls (usually) a python script to generate the simulated observables, put in observables.json (unless otherwise specified in the config file) It is run by the software with a pwd of the model interface directory (such as ea001 )

Arguments are:

  • json config filepath for the study
  • filename , usually observables.json (maybe superfluous to define this, but we do)

Outputs:

  • simulate observables in the file whose name was specified in the arguments (observables.json typically), in the interface directory.

To access the model output data, code uses,in the interface directory the optclim tasks have set up a symlink to the cylc suite's directory in which data derived by the model can be found. This can be wither history, which links to the History_data directory in Cylc for that suite, or cylc-dir, the top directory. These are set up by optclim_prerun a d optclim_postrun tasks. So for example, in the PP...out file in a jobOutput directory of a study is

/work/n02/shared/mjmn02/OptClim/st2024/post_process/simobs_wrapper_UKESM1_1.sh UKESM_2.json observations.json
where the pwd is /work/n02/shared/mjmn02/OptClim/optclim2A/studies/UKESM_P2/898/on/on003

A wrapper script around python is used to set the python environment. NOTE: these simobs scripts are called with a present working directory of the interface directory so the path to data is to be coded relative to one of the symbolic links provided. The output is to go into that interface directory.

Steps to take to define simobs:

In the study's JSON file set up each simulated observable with

  • target values
  • covariances - to be documented

Examples of the user-generated functions to generate simulated observations:

dummy calculation of simobs

In the study JSON:

   "postProcess": {
        "comment": "Options to control post processing. Details depend on your post processing script.",
        "script": "$OPTCLIMTOP/UKESM/simobs_dummy.sh",

where simobs_dummy.sh takes the run parameters and copies into an observation.json results file, in a study set up for two parameters and corresponding two simnobs.:

echo "------------------------------------------------------"
echo in simobs_dummy.sh for crude initial test of UKESM-OptClim workflow
echo workign directory is now:
pwd
# just rename the run's parameters to be simulated observables
cp runParams.json observations.json
sed -i -e "s/iau_nontrop_max_p/iau_nontrop_max_p_tst/" observations.json
sed -i -e "s/diagcloud_qn_compregimelimit/diagcloud_qn_compregimelimit_tst/" observations.json

This was used in /work/n02/shared/mjmn02/OptClim/optclim3/studies/UKESM_P2/ex3

realistic simobs calculation

The demo.json in /work/n02/shared/mjmn02/OptClim/optclim3/studies/UKESM_P2/898/on specifies the simobs script. A shell script sets the environment (to get for example a required Python environment) and then calls python as:

cat $OPT_ST_UKESM/post_process/simobs_wrapper_UKESM1_1.sh
# run in serial queue unless we have changed:
# $OPTCLIMTOP/archer2/postProcess.slurm

# runs as a task int he array job under SLURM.

# currently have this script to create the environment for OPTCLIM
. /work/n02/shared/mjmn02/OptClim/setup_optclim2.sh 
# this is run in the model's directory, and has symlink history to the data.
echo "WD is $PWD"
cmd="$OPT_ST_UKESM/post_process/comp_sim_obs_UKESM1_1.py --verbose --clean $@"
echo calling:  $cmd
result=$($cmd) # run the cmd
echo $result

OPT_ST_UKESM is /work/n02/shared/mjmn02/OptClim/st2024

How OptClim starts a new model instance

OptClimVn2/UKESM.py:306
        pumacmd="/home/n02/n02/mjmn02/dev/ModelOptimisation/Rose/onPUMA/launchArun.sh"
        cloneCmd="ssh puma2 %s %s %s \n" %(pumacmd,self.dirPath,self.refDirPath())
        subout=subprocess.check_output(cloneCmd, shell=True)  # submit the script

If not using the mjmn02 code on puma2 (should be ok!) need to fix the hard coded file path - bug 3 in this github project... User needs base job on puma2, in their own account and suite name in the json of the study.

"study": {
    "comment": "Parameters that specify the study. Used by framework and not by optimisation routines",
    "referenceModelDirectory": "fake-nrun",

Running the demo

Assuming you have already installed the system - see page OptClim-USESM installing

  1. copy base suite to your ~/roses directory on PUMA2:
cp -Lr ~mjmn02/roses/fake-nrun ~/roses # -L means copy sym links.

EDIT ARCHER2_USERNAME in rose_suite.conf - replacing mjmn02. (if you changed the suite name in the copy, change it in the JSON below also) If you wish to see what the suite looks like, in its first cycle, without running anything

rose suite-run -l   # registers (puts the suite in the  cylc directory) but does not run it, due to the -l option.
cylc graph fake-nrun

Doing this register/graph is not necessary, but can help to diagnose errors. (Note a base suite will fail, if executed outside OptClim. It needs to be cloned then the clone is run inside OptClim - this being automated by OptClim)

  1. On ARCHER2 create a directory from which to start the study, and cd into it.

  2. Copy the JSON that defines the run:

cp /work/n02/shared/mjmn02/OptClim/optclim3/studies/UKESM_P2/demo/demo1.json .

and check the path to the postprocess dummy-simobs script is correct for you. Make the template's prefix for jobs (in the json) unique otherwise there can be confusion in the cylc runs.

  1. setup your OptClim environment
. ~/setup_optclim.sh
  1. Start the suite

On archer2, in studies' top directory that holds the json config:

runAlgorithm.py demo.json > trace.txt 2>&1 &

It takes a few minutes to load python libraries, run the optimiser, set up interface directories and in each write their runParams.json file. The redirection to trace.txt is optional and to see trace while its running

tail -f trace.txt

Wait a few minutes before worrying then on ARCHER2:

  • On PUMA2, "cylc scan" or "cylc gscan & " shows all current suites; right click on one to see its progress with gcylc. Note the cylc server polls, with a quite slow cycle, so do not expect real-time updates to the gui of a suite nor quick changes in the SLURM job queue. image

  • On Archer2, "squeue -u $USER -l" shows jobs in the queue. The first to be set up will be: -- with prefix RE, the array of jobs corresponding to each run, held until it is released by the postrun script, using optclim_finished -- with prefix PP the "postprocessing job" released on completion of the array, and this reruns runAlgorithm

For example

squeue -u mjmn02 -l
Thu Mar 28 15:58:30 2024
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           6084151    serial    RE...   mjmn02  PENDING       0:00     20:00      1 (Dependency)
     6084150_[1-5]    serial    PP..   mjmn02  PENDING       0:00     20:00      1 (JobHeldUser)

These batch jobs direct STDOUT and STDERR output to studydir/jobOutput/PP.... and studydir/jobOutput/RE....

How to stop a suite

on archer2, kill jobs with "scancel"

on puma2: to see whats running, then kill a selected suite:

cylc scan
cylc stop suitename

Use gscan for a updated graphical presentation - you can kill jobs from that too.

⚠️ **GitHub.com Fallback** ⚠️