Output - cholla-hydro/cholla GitHub Wiki

This page describes Cholla data and outputs.

Basic Background

Cholla simulations run on a spatial domain corresponding to a rectangular prism. Cholla primarily deals with 2 types of data:

  • grid-data: Regularly spaced data composed of a large grid/mesh of equal-sized cells. During simulations, we commonly call this "field" data. The canonical example of this datatype is the collection of finite-volume fields used for hydrodynamics. These are tracked at cell-centers. There are also other kinds of fields (e.g. cell-face-centered magnetic fields used in Constrained Transport).

  • particle-data: Cholla can also track a collection of "particles," where each particle has unique set of properties. These properties always include an id, position, and velocity. They can also have other properties (e.g. mass). These are commonly used to model dark matter or star-clusters.

Domain Partitioning

The simulation domain is partitioned into 1 or more equal volume (non-overlapping) "blocks." The blocks are conceptually organized on a 3D "block-location" grid, whose shape we denote as (BLx, BLy, BLz). When a simulation starts, this shape is always chosen to match (n_proc_x, n_proc_y, n_proc_z). Each process in a simulation evolves the contents for a separate block.

How decomposition relates to the data-types:

  • grid-data: conceptually all fields live on a global Domain-grid composed of (nDx,nDy,nDz) cells. In practice, the underlying field data is partitioned among the blocks. A given block is responsible for tracking data on (nBx,nBy,nBz) cells, where (nDx,nDy,nDz) = (BLx*nBx, BLy*nBy, BLz*nBz).
  • particle-data: each block tracks a separate list of particle properties

File Format Overview

Cholla can write many kinds of datasets. The standard kinds of datasets broadly fall into 2 categories:

  • snapshot-data, which is a "snapshot" of the simulation state (and is required for restarts). This consists of general field-data, gravity-data, and particle-data.

  • derived data on 2D grids (e.g. slices/projections).

Each time Cholla writes an output, it writes a separate file for each data-kind and for each block. For example, a simulation with 16 processes that writes 5 data-kinds at a given cycle will produce 5*16=80 files. We provide scripts to concatenate (or consolidate) this data into fewer files.

By default, Cholla writes files in the hdf5 format. HDF5 files store data as "attributes" (header-variables describing the data) and "datasets" which stores the data itself. HDF5 provides a lot of flexibility for organizing this information in a hierarchy of "groups" (similar to how you can organize different files in a hierarchy of directories).

Note

For certain kinds of snapshot-data Cholla stores data in memory with "ghost zones". "Ghost zones" are NEVER written to disk.

Note

This primarily pertains to 3D data:

  • while the z-axis is generally the fastest access index in the datafile, the x-axis is generally the fastest access index in the simulations
  • the gravity array has historically been stored as a 1D array. I suspect that the data was not reordered (so the memory layout is "native")

Flat (Default) Schema:

Cholla's standard "schema" (strategy for organizing data) is "flat" (i.e. there is no hierarchy). We represent this schema below:

/                                      # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── <dataset-0>
 ├── <dataset-1>
 └── ...

At the time of writing, all files written by Cholla use this schema. Historically all concatenation scripts would use this schema.

Hierarchical Schema

An alternative schema, adopted when we repack snapshot data, is shown below. Presently, this is the scheme produced by scripts that concatenate 3D field-data or repacking previously concatenated field data

/                                  # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── domain/       (REQUIRED)
 │    ├── blockid_location_arr     # shape: (BLx,BLy,BLz)
 │    └── stored_blockid_list      # shape: (nBStored,)
 └── field/
      ├── <field-0>     # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
      ├── <field-1>
      └── ...

In the above diagram:

  • we follow numpy conventions for describing arrays with C-contiguous layouts. In other words, the fastest index is last axis
  • BLx,BLy,BLz refer to the number of blocks per axis.
  • nBx,nBy,nBz refer to the number of cells per block. This is the shape of a cell-centered field.
  • nBStored is the number of blocks in the file. It should nominally be 1 or BLx*BLy*BLz
  • "domain/blockid_location_arr" specifies the relative locations of the blocks
  • the data at {field/<field-0>}[i, ...] corresponds to the data of the block with blockid specified by {domain/stored_blockid_list}[i]

Note

Files in this format ALWAYS provide the "dims" attribute (in HEADER-ATTRS) and the "domain" group. Importantly, "dims" specifies (nDx,nDy,nDz), the number of cells on the conceptual global Domain-grid, and the shape of "domain/blockid_location_arr" is (BLx,BLy,BLz). Thus, you always can infer (nBx,nBy,nBz) = (nDx/BLx, nDy/BLy, nDz/BLz). Consequently, you can determine whether a field/<field-0> is cell-centered, face-centered, etc. by looking at the field's shape.

The above format is intended to be forward-compatible with a schema that also stores particle-data and gravity-data in the same file. We illustrate what this could look like down below

ASIDE: Preview of more general schema

Here we sketch a more-general hypothetical schema that can also include particle and gravity data:

/                                      # root group
 ├── HEADER-ATTRS  (REQUIRED)
 ├── domain/       (REQUIRED)
 │    ├── blockid_location_arr        # shape: (BLx,BLy,BLz)
 │    └── stored_blockid_list         # shape: (nBStored,)
 ├── field/
 │    ├── <field-0>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    ├── <field-1>        # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 │    └── ...              # 4D shape: (nBStored,nBx,nBy,nBz) or (nBStored, ...)
 ├── particle/
 │    ├── ATTR:total_particle_count    # i64
 │    ├── stop_particle_idx            # 1D shape: (nBStored,)
 │    ├── <particle-prop-0>            # 1D shape: (stop_particle_idx[-1],)
 │    ├── <particle-prop-1>            # 1D shape: (stop_particle_idx[-1],)
 │    └── ...                          # 1D shape: (stop_particle_idx[-1],)
 └── gravity/
      └── gravity          # 4D shape: (nBStored,nBx,nBy,nBz)

A few notes about particle group in this hypothetical extension:

  • particle/ATTR:total_particle_count specifies the total number of particles in the ENTIRE simulation.
  • particle/stop_particle_idx holds monotonically non-decreasing values.
    • when nBStored == nBx*nBy*nBz, then {particle/stop_particle_idx}[-1] =={particle/ATTR:total_particle_count}
    • in other cases, {particle/stop_particle_idx}[-1] <={particle/ATTR:total_particle_count}
  • The values of that describe particles for the blockid specified by {domain/stored_blockid_list}[i] are given by {particle/<particle-prop-0>}[slc], where slc is:
    • 0:{particle/stop_particle_idx}[0], when i is 0
    • {particle/stop_particle_idx}[i-1]:{particle/stop_particle_idx}[i], in all other cases

Header Attributes

The following attributes are attached to all Cholla outputs:

  • 'gamma', the ratio of specific heats that the simulation was run with
  • 't', the time of the snapshot, in code units (usually kyr)
  • 'dims', a 3-element attribute that gives the number of cells per axis in the x, y, and z directions for the conceptual grid that spans the entire domain
  • 'n_step', the simulation step when the data was output

Additional attributes that are attached to newer Cholla outputs include:

  • 'dx', a three dimensional attribute that gives the x, y, and z dimensions of a cell, in code units,

and a series of "unit" attributes, that provide the conversion between whatever units the code was run in and cgs:

  • 'length_unit'
  • 'time_unit'
  • 'mass_unit'
  • 'density_unit'
  • 'velocity_unit'
  • 'energy_unit'

So, for example, if the code was run with a mass unit of one solar mass and a length unit of one kpc, and you have read the density into an array called 'd', multiplying d by the density unit would convert the density array to g/cm^3.

Field Data

Tip

In files using the data hierarchical-format, this data is all stored within the "field" hdf5 group.

Datasets contain the different fields that are evolved by the simulation. They are 1, 2, or 3 dimensional arrays, corresponding to the dimensionality of the simulation (and specified by the nx, ny, nz attributes, as described above). The following are the conserved variable fields that are always outputs for a hydrodynamic simulation (all output in code units):

  • 'density', the mass density in each cell (i.e. M_sun / kpc^3)
  • 'momentum_x', the x-momentum density
  • 'momentum_y', the y-momentum density
  • 'momentum_z', the z-momentum density
  • 'Energy', the total energy density

If Cholla is run with the dual energy flag ('DE'), the thermal energy field will also be present:

  • 'GasEnergy', the thermal energy density, equivalent to the total energy density minus the kinetic energy density.

If Cholla is run with the passive scalar flag ('SCALAR'), a number of scalar fields may also be present, e.g.:

  • 'scalar0', the value of the first passive scalar.

If Cholla is run with 'DUST':

  • 'dust_density', the dust density in code units (M_sun / kpc^3)

Slices, Projections, and Rotated Projections

For 3D simulations, Cholla can also be run with flags to output slices and projections of the data. (This can be useful for larges simulations if saving the full dataset is too costly to achieve a high time resolution for snapshots.) The relevant make flags are 'SLICES', 'PROJECTIONS', and 'ROTATED_PROJECTIONS'. All produce hdf5 files similar to full grid outputs, with the same header attributes. Datasets present in slices include all the conserved variables, as well as the thermal energy density and scalars, if the simulation was run with them. Three slices will be output: 'xy' slices (a slice along the z-midplane), 'xz' slices (and a slice along the y midplane), and 'yz' slices (a slice along the x midplane). Datasets in the hdf5 file are named according to which direction the slice was made. For example, datasets in a 'slice_xy' file will include:

  • 'd_xy', the mass density in cells along the z-midplane of the simulation
  • 'mx_xy'
  • 'my_xy'
  • 'mz_xy'
  • 'E_xy'
  • 'GE_xy' (if DE is on)
  • 'scalar_xy' (if SCALAR is on)

Projections are similar to slices in that they are 2 dimensional datasets, but are integrated along the relevant direction. Currently, Cholla outputs density projections, and density-weighted temperature projections (density times temperature for a given cell). Both are output in code units. So, for example, if the code was run with density units of M_sun/kpc^3, a density projection output will have units of M_sun / kpc^2 and a temperature projection will have units of M_sun K / kpc^2. The PROJECTIONS flag outputs xy and xz projections (integrated along the z and y axes, respectively). Datasets are called:

  • 'd_xy'
  • 'T_xy'
  • 'd_xz'
  • 'T_xz'
  • 'd_dust_xy', (if 'DUST' was used)
  • 'd_dust_xz', (if 'DUST' was used)

Rotated projections are similar, but are integrated along an axis specified by the input parameter file (see the relevant wiki page for details).

Particle data

TODO

Gravity data

At the time of writing, this is just used for restarts

Scripts

We provide a variety of scripts for modifying outputs in the python_scripts directory.

Concatenation scripts

The scripts following scripts are provided (for use as command-line tools or as python modules) to help with concatenation:

  • concat_2d_data.py, for concatenating 2D datasets such as slices, projections, and rotated projections
  • concat_3d_data.py, for concatenating field data (aka 3D datasets)
  • concat_particles.py, for concatenating particle datasets

What is concatenation?

As noted above, whenever Cholla outputs data, a separate file is written for each data-kind and for each block. Concatenation is the process (for a single data-kind) of consolidating the contents of the files for all of the blocks into a single file. Historically, this procedure would produce an HDF5 file with the "Flat Format," where the data is stitched together in a way that roughly approximates the file that would be produced by a simulation run with a single block that spanned the entire domain. This is indeed still the behavior of concat_2d_data.py and concat_particles.py.

However, more recent versions of cholla include versions of concat_3d_data.py that will produce files HDF5 files in the "Hierarchical Format."

CLI Usage

The CLI for all the scripts is similar and details can be found when passing the --help option to the script. In general you need to tell the script which directory to read files from (the -s/--source-directory flag), where to write the concatenated files (the -o/--output-directory flag), how many ranks were used (the -n/--num-processes flag), and which outputs to concatenate (the --snaps flag). The --snaps flag accepts a couple of different input formats, it can be a single number (e.g. 8), a range (e.g. 2-9), or a list (e.g. [1,2,3]); ranges are inclusive.

Example

./concat_3d_data.py -s /PATH/TO/SOURCE/DIRECTORY/ -o /PATH/TO/DESTINATION/DIRECTORY/ -n 8  --snaps 0-10

Import Usage

The scripts above contain three public functions, concat_2d_dataset, concat_3d_dataset, and concat_particles_dataset. These functions will each concatenate a single output time of a 2D, 3D or particle dataset respectively and can be imported into another python program assuming the scripts are in your python path. Generally the easiest way to import this script is to add the python_scripts directory to your python path in your script like this:

import sys
sys.path.append('/PATH/TO/CHOLLA/python_scripts')
import concat_3d_data

Repack

Next, we turn our attention to the script called snaprepack.py. This file is intended to be used to repack a previously concatenated snapshot file. The output file will use the Hierarchical Format.

Ideally, you can use this script by invoking:

./snaprepack.py -s PATH/TO/SOURCE/FILE.h5 -o /PATH/TO/OUTPUT/DIRECTORY

The above will produce an error if the file is missing the "nprocs" attribute from the header. In that case, you can supply it via the --missing-nprocs-triple argument. For example, if "nprocs" should hold (4,2,2), then you could pass

./snaprepack.py -s PATH/TO/SOURCE/FILE.h5 -o /PATH/TO/OUTPUT/DIRECTORY --missing-nprocs-triple 4 2 2

NOTE: we currently do NOT support overriding the value of "nprocs".

For more details about options, invoke snaprepack.py --help

⚠️ **GitHub.com Fallback** ⚠️