Optimizing History with the PFIO server - GEOS-ESM/MAPL GitHub Wiki

Using the PFIO server

Applications using the MAPL_Cap such as the GEOSgcm, GEOSctm, GEOSldas, GCHP, etc... use the MAPL_History for diagnostic output. All output is routed through the PFIO server. Extra resources beyond the needs of the model in an MPI sense can be allotted for the PFIO Server to exploit asynchronous output and decrease your applications wall time. By default the user does not have to do this and History will work just fine if you don't, although maybe not optimally. This page will explain how and when to configure the PFIO server to run on separate resources. While we can give general recommendations for how to configure the PFIO server we cannot emphasize enough that you should run your application multiple times and tune it to find the optimal configuration for you use case as what may be appropriate in one use case is not in another.

Simple Server

As was stated in the introduction, the MAPL_History component is always writing the output via the PFIO server. However, in the default case (i.e. the user does nothing other than start the application on the number of MPI tasks needed for the model) the PFIO server is running on the same MPI resources as the application. Each time the run method of History is executed it will not return until all the files that were to be written in that step have completed. All the data aggregation and writing is done on the same MPI tasks as the rest of the application so the it cannot proceed until all output for that step is completed. There is no asynchronicity or overlap between compute and write in this case.

At low resolutions of your model run or cases with little history output this is sufficient. For example if you are running the GESOgcm.x at c24/c48/c90 for development purposes with a modest History output on 2 or 3 nodes, there's no sense in dedicating any extra resources for the PFIO server.

Multigroup Server

For exploiting asynchronous output when using History we recommend using the multigroup server option for the PFIO server. With PFIO server, the model (or application) does not write the data to the disk directly. Instead the user launches the application on more MPI tasks than is needed for the application. The extra MPI tasks are dedicated to running the the PFIO server. When the user chooses the "mutligroup" option, the server is itself split into a "frontend" and "backend". Only the "backend" actually writes to disk.

The "frontend" of the server functions as a memory buffer. When History decides it is time to write, the data is processed if necessary (regridding for example) to the final form. Then the data is forwarded from the application MPI ranks to the "front end" of the server which is on a different set of MPI ranks. As soon as the data is forwarded the model continues.

Once all the data has been received by the "frontend" of the server, the data is forwarded to the "backend" on yet a different set of MPI ranks. In the currently implementation each collection to be written is forwarded to a single processor on the backend based on what are available. Note that some may still be writing from the previous write request. That's fine as long as there are still some resources on the backend available. Also note that this implies a collection must fit in a single node memory.

The user specifies the total number of nodes to use for the PFIO server and the number of MPI ranks per node that will be allocated to the "backend". Note that this then also determines the size of the "frontend". This is specified on the command when when running GEOSgcm.x or any other program that uses the MAPL_Cap. As an example, suppose we have 16 nodes with 45 cores per node. We want to use 540 cores for the model (so NY*NY=540), which leaves 180 cores or 4 nodes for the server and 20 backend MPI ranks per node. This results in 80 cores on the "backend" and 100 on the "frontend" You would launch the GESOgcm.x like so:

mpirun -np 720 ./GEOSgcm.x --npes_model 540 --nodes_output_server 4 --oserver_type multigroup --npes_backend_pernode 20

Note that we have to tell how many core gets assigned to the model explicitly.

Recommendations

For the best performance, users should try different configurations of PFIO for a specific run. They will generally find that the after a certain point they will hit a limit where the walltime does not decrease despite adding more resources. In general, there is a "reasonable" estimated configuration for users to start with. If you run a model requiring NUM_MODEL_PES of cores, each node has NUM_CORES_PER_NODE, the total number of history collections is NUM_HIST_COLLECTION, then

MODEL_NODE = (NUM_MODEL_PES/NUM_CORES_PER_NODE),

O_NODES =(NUM_HIST_COLLECTION+ 0.1*NUM_MODEL_PES) / NUM_CORES_PER_NODE

NPES_BACKEND = NUM_HIST_COLLECTION/O_NODES

TOTAL_PES = (MODEL_NODE + O_NODES)*NUM_CORES_PER_NODE

All above number should round up to an integer.

The run command line would look like

mpirun -np TOTAL_PES ./GEOSgcm.x --npes_model NUM_MODEL_PES --nodes_output_server O_NODES --oserver_type multigroup --npes_backend_pernode NPES_BACKEND

Dataflow

The follow represents the data flow when writing a single collection. Note that the blue MPI ranks are free to move on as soon as they have sent their data to the red MPI ranks.

mutltigroup_server