Recovery from output stager failures - NOAA-GFDL/CEFI-regional-MOM6 GitHub Wiki

1. Double check that the output stager is not currently running.

On a Gaea login node, run squeue -u $USER -o "%.18i %.2t %.9P %.50j". The output will look something like

             JOBID ST PARTITION                                                NAME
         207218841  R     batch                       OM4_0125_COBALTv3_jra55_const
          67717255  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218635.output.st
          67719815  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218744.output.st
          67718118  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218685.output.st
          67719886  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218744.output.st
          67718911  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218720.output.st
          67718958  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218720.output.st
          67720673  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218778.output.st
          67720747  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218778.output.st
          67721568  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218797.output.st
          67721669  R dtn_f5_f6  OM4_0125_COBALTv3_jra55_const.o207218797.output.st

All output stager jobs will be on the partition dtn_f5_f6. In this case I have ten output jobs running. Neither are for the experiment that I am trying to restart, which is named OM4_0125_COBALTv3_jra55_const.

2. cd to the fre experiment directory.

The fre experiment directory will be within your scratch space on f5 or f6, followed by the FRE stem, followed by the experiment name and platform. In the example case I'm using, the full path to the experiment directory is /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod

3. Check if there are any lock files and remove if necessary.

Within the experiment directory, run ls state/run/*.lock. If you see lock files leftover from the output stager job that failed, you should remove them:

> ls state/run/*.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.A.args.lock  NWA12_COBALT_2024_09.o207177135.output.stager.19960101.R.args.lock
NWA12_COBALT_2024_09.o207177135.output.stager.19950101.H.args.lock

> rm -f state/run/*.lock

Use rm with more care if you have actively running jobs in addition to the job that failed.

4. Run output.retry.

Stay in the same experiment directory. Load the appropriate fre module (e.g., module load fre/test) if you haven't already. Then run output.retry state/run. This will automatically attempt to submit all output stager jobs that haven't completed.

5. If needed, remove records of previous attempts to retry the output stager.

Sometimes the output stager jobs have already been automatically retried the maximum number of times (6). In this case, output.retry will report an error. FRE keeps track of how many times it has retried a job by appending @ xferRetry++ to the file containing arguments for the job. You can delete all past records of retrying and re-run output.retry by running find state/run/ -name '*.args' | xargs sed -i '/xferRetry++/d' && output.retry state/run/.

6. If all else fails, manually gcp.

Normally, as long as the local output stager job completed, the raw files will be stored in the same FRE experiment directory. For example, my history files for this example experiment are stored within /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history. To transfer a tar file to GFDL, I would

cd /gpfs/f6/ira-cefi/scratch/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/ncrc6.intel23-prod/archive/history
module load gcp
gcp --batch 19940101.nc.tar gfdl:/archive/Andrew.C.Ross/fre/NWA/2024_09/NWA12_COBALT_2024_09/gfdl.ncrc6-intel23-prod/history/

Note that including --batch will submit it as a job to a data transfer node. If there is a complete failure with the data transfer nodes, this option probably won't work either.