CMCC_CM debugging techniques - CMCC-Foundation/CMCC-CM GitHub Wiki
This page contains techniques which are useful when the model is crashing. Note that this is not a page about using parallel debuggers since availability is often limited.
If you are running into either build or run-time errors that you don't understand, then first try building and running the model with compiler debug flags enabled. This can be done by setting DEBUG
to TRUE
in env_build.xml
in your case directory, and then re-building and re-running the model.
The model should still fail, but the information in the logs should be more informative, and may even point you to the exact line of code that has an error. Please note that if the model fails at run-time this "stack trace" to the problematic line is likely going to be in the cesm.log.XXXX
file, which is usually located in your case's run directory.
Finally, if you are using the Intel compiler then you may receive a specific error code associated with your runtime failure. A description of what that error code means can be found on Intel's website.
If running with debug flags did not help you track down the issue, then if possible try running with a different fortran compiler, ideally with DEBUG
still set to TRUE
. Often times one compiler will report an error that another compiler may simply ignore, or try to manage on its own. It might even generate a proper stack trace even if the first compiler failed to do so.
The easiest way to change your compiler (when on a supported machine) is to create a new CMCC-CM case using create_newcase
with the --compiler
flag that specifies which compiler to use. If you aren't sure which compilers are available on your machine, then run the command:
query_config --machines
and search for your particular machine's name, which should include a "compilers" line that lists all available compiler options. This command will be located in the same location as create_newcase
. Again, please note that these instructions will only work for machines where CMCC-CAM, CMCC-CM, or CIME have been properly ported.
Sometimes, it seems like you make a change to your code but nothing changes in the log or the output. Often, this is because your new code is not being compiled into the model. To check this, find these three pieces of information:
- In your case directory, look at your
CaseStatus
file. Search from the bottom for the message, "case.build success". Record the date on that line. - Record the modification date of the source file you modified (
ls -l <file>
). - Also in your case directory, find the source directory (
./xmlquery SRCROOT
).
If the date / time of your source is newer than the date / time in your CaseStatus
, you may just need to rebuild your case (./case.build
). Depending on what changes you have made, you might also need to do a full rebuild (rm -rf bld; ./case.setup --reset; ./case.build
). Make sure you are building the correct code (see next item).
If the CMCC-CAM root directory containing your modified source does not match the SRCROOT directory, then you are modifying the wrong file (yeah, we've all done that). Either create a case from the right source directory or move your modifications into the SRCROOT
directory.
The easiest way to stop CMCC-CAM is to set STOP_OPTION
and STOP_N
to values which stop the model after a particular time step. However, if something is causing the model to crash, it can be helpful to stop the model at a particular point in the code. This can be done by inserting a call to endrun
into the code.
First, ensure that endrun
is imported. Make sure this statement is in the subroutine or module where you want to call endrun
:
use shr_sys_mod, only: endrun => shr_sys_abort
Then, simply insert a call to endrun
in the desired location:
call endrun('Stopping CMCC-CAM')
This message will show up in the atm.log.######.<machine>.<date>-<time>
file if the masterproc
(MPI task 0) hits this call and in the cesm.log.######.<machine>.<date>-<time>
file for most other MPI tasks (sometimes the program quits before all messages are written to the log file).
To make the endrun
message more useful, create a message:
! Add this statement where the other routine variables are declared
character(len=256) :: error_msg
[ . . . other statements . . .]
! Add these statements where you want to stop the model, add formatting in place of * if desired
write(error_msg, *) 'Stopping the model because X (', x, ') is > 1234.0'
call endrun(trim(error_msg))
The endrun
call can even be added conditionally:
if (x > 1234.0_r8) then
write(error_msg, *) 'Stopping the model because X (', x, ') is > 1234.0'
call endrun(trim(error_msg))
end if
or
if (nstep >= 12) then
call endrun('Stopping the model at nstep >= 12')
end if
This is a common problem with short runs. This error message is a hint that there is a timing mismatch. The source is often that by default, the runoff model and/or the land ice model is running much less frequently than the atmosphere and the run length is not an integral multiple of the slowest component model.
The solution is to choose a run length that allows an integer number of component runs for every component (e.g., ./xmlchange STOP_OPTION=nsteps,STOP_N=48
when the atmosphere timestep is 30 minutes) or turn up the run frequency of the other components (./xmlchange ROF_NCPL=48,GLC_NCPL=48
).
BTW, an easy way to get a snapshot of the various run frequencies is ./xmlquery --partial NCPL
.
cam_snapshot
is a set of routines which will write out all of the fields in state
, constituents
, tend
, ptend
, cam_in
, cam_out
and pbuf
along with a few fields that are just local to tphysac
and tphysbc
. The times that these fields are written out are controlled by the "cam_snapshot_before"
and "cam_snapshot_after"
types of variables. "cam_snapshot_before"
variables are used to capture the model variables before a particular physics parameterization is called and "cam_snapshot_after"
is used to capture variables after the parameterization. cam_snapshot
is controlled by four namelist variables:
-
cam_snapshot_before_num
- the output file number for the before snapshots (for example, setting to 6 will result in the values being written to the h5 file) -
cam_snapshot_after_num
- the output file number for the after snapshots (for example, setting to 7 will result in the values being written to the h6 file) -
cam_take_snapshot_before
- the name of the parameterization before which all fields will be output -
cam_take_snapshot_after
- the name of the parameterization after which all fields will be output
In addition, it is almost always the case that a user will want to specify that the information is written out on every time step, so the corresponding elements in nhtfrq
should be set to 1
in user_nl_cam
If the model is crashing, set the corresponding elements of mfilt
to 1
in user_nl_cam.
If the cam_take_snapshot_before
and cam_take_snapshot_after
are set to the same parameterization, then the changes made by that particular parameterization are isolated. If they are set to different parameterizations, then the values will be output before the parameterization specified by cam_take_snapshot_before
is called and after the cam_take_snapshot_after
parameterization completes.
cam_pio_dump_field
is a function which immediately writes a NetCDF file with information from a field. For example:
call cam_pio_dump_field('CLD', 1, pcols, 1, pver, cld)
will write the field, cld
, to a file called CLD_dump_<##>.nc
where <##>
is a number starting at one and increasing as this call is repeated. The file simply contains the contents as a 3-dimensional array where the first two dimensions are given by the bounds (1:pcols
and 1:pver
) and the third dimension is the MPI task number (1:npes
).
cam_pio_dump_field
can also handle 3, 4, and 6-dimensional fields, just call the function with the appropriate number of bounds for the field.
Note that by default, cam_pio_dump_field
collects the bounds from all MPI tasks and uses the largest range for the NetCDF file. To skip this step, set the optional variable, compute_maxdim_in
, to .false.
.
pbuf_dump_pbuf
is similar to cam_pio_dump_field
in that it immediately writes NetCDF files. The main difference is that is cannot be called from a threaded region and requires access to the full pbuf
(aka the pbuf2d
variable). The call is:
pbuf_dump_pbuf(pbuf2d, name, num)
where pbuf2d
is the full pbuf, name
is an optional name to be added to each filename, and num
is an optional integer to be added to each filename.
pbuf_dump_pbuf
then writes a NetCDF file for each field in the pbuf for this run. The file format is the same for cam_pio_dump_field
(see above).