access_UMResubmission - ACCESS-NRI/accessdev-Trac-archive GitHub Wiki

Climate model automatic resubmission and restarting after errors

The UM has a facility for automatically resubmitting another job after successful completion. This can be used to do long climate simulations in manageable chunks. In the vn7.3 UMUI the chunk size is set in the follow-on panel to the "Job submission, resources and re-submission" panel (via the NEXT button). In vn8.0 and later there's a separate "Re-submission pattern" panel.

A normal job processed by the UMUI has TYPE=NRUN in the SUBMIT file. This forces a new run, starting from the dump file and date specified in the UMUI. A continuation run has TYPE=CRUN. From UM vn8.1 on this can be set as a UMUI option, but in earlier versions it must be set via a hand-edit or by directly editing the SUBMIT file after processing and before submitting.

After each dump file is written (i.e. at each potential restart point), the model writes a file RUNID.thist in the run directory (temporary history - where RUNID is the name of the experiment). At the successful completion of a model job the utility script qspickup adds the temporary history to a permananent history file RUNID.phist and then deletes the temporary file. Both the thist and phist files have the same format containing 5 fortran namelists

&NLIHISTO
&NLCHISTO
&NLIHISTG
&NLCHISTG
&NLCFILES

The phist file has a set of these for each job restart, with the most recent at the top.

Relevant parts of this from a run that has completed a 3 month block are

&NLIHISTO
 RUN_RESUBMIT_TARGET = 30, 9, 1, 3*0,

End time of last completed run, nyears, nmonths, ndays etc

&NLIHISTG
 H_STEPIM        =                539088, 3*0,

Number of model time steps completed (first value is atmospheric model, others are components of the HadGEM2 coupled model and so not relevant here).

&NLCHISTG
 END_DUMPIM = 'xahnea.dak8a10', 3*'              ',
&NLCFILES
 ASTART   = 'ASTART  : $DATAM/xahnea.dak8710                                                 ',
 ARESTART = 'ARESTART: $DATAM/xahnea.dak8a10                                                 ',

This specifies the start and end files for the previous block. The next run will start with the ARESTART file.

When a CRUN job starts it first checks for the existence of a RUNID.thist file. If this exists (indicating that the previous run did not complete properly) it restarts using the information in this. Otherwise it uses the information in the RUNID.phist file

Sometimes it's necessary to intervene in this process, for example, to restart after a crash

Restarting after crashes

The ACCESS climate runs occasionally crash due to large vertical velocities developing over the Himalayas (frequency is of order once per decade). These incidents could likely be prevented by running with a shorter timestep but it's more economical to simply restart the model from a perturbed dump file. Unless the crash is very close in time to the dump this is usually sufficient to avoid the problem. The normal Met Office approach is to rerun the last month with an increase number of convection calls per timestep. This is effectively perturbing the model evolution by a change in the physics and has the same effect as perturbing the initial condition.

To do this on raijin

% module use ~access/modules
% module load pythonlib/umfile_utils
% python ~access/apps/pythonlib/umfile_utils/perturbIC.py dumpfile

Note that this modfies the file in place. Resubmitting the job will restart from this last dump file.

The default is an perturbation of amplitude 0.01 K applied to the potential temperature. If the restarted model still crashes then this can be increased using the -a argument, e.g.

python ~access/apps/pythonlib/umfile_utils/perturbIC.py -a 0.1 dumpfile

Restarting from an earlier dump file

The cleanest way to do this is perhaps an NRUN from the required file, followed by a switch back to the usual CRUN process. However it's also possible to do it more directly by manipulating the history files.

To restart from the end of the last successfully completed run (rather than from an intermediate dump file), just remove the RUNID.thist file. To restart from an even earlier point you can change the RUNID.phist file. Remove sets of the 5 namelists to get back to the desired starting position (i.e. the first occurence of ARESTART has the name of the file you want to start from). Be careful that the namelists are still in the correct order. The first one should be &NLIHISTO.

Changing ancillary files

Again, not really recommended behaviour but it can be useful.

When runs are resubmitted automatically the model gets the names of the ancillary files from the RUNID.pihst file. E.g.

&NLCFILES
 SULPEMIS = 'SULPEMIS : $CMIP5ANCIL/scycle_1850_2000_IPCCf',

If the model run fails because you've run off the end of an ancillary file you might want to switch to another one to continue the run, e.g. in an AMIP run from 1979-2010 that crosses the end of the historical emissions files. Again the correct approach is to change the files in the UMUI job and resubmit as an NRUN and then a CRUN (even better would be to create a new set of ancillary files that cover the whole of the required period). The quick fix is to change the names of the files directly in the RUNID.phist file. It's only necessary to change the most recent (top) namelist. E.g. one could set

SULPEMIS = 'SULPEMIS : $CMIP5ANCIL/sulp_RCP45_2000_2100f.N96',

and then resubmit the job.