Proposed MAPL3 History Format - GEOS-ESM/MAPL GitHub Wiki

Topics that needs a talk/yet to be pinned down syntactically

  • field splitting, define behavior and syntax, (sidebar, do we allow crazy things like arithmetic on split fields etc. someone asked...). Could we make field splitting just a use case of subsetting?
  • File duration, time definition, and related time issues
    • Does template and writing time, and only those 2 things define what file is written to when time to write, no separate duration keyword in other words? What time index in that file is another matter, that depends on the frequency and reference time...
    • Is time unlimited or fixed?
    • Do we allowing appending to files from previous executions if indicated by template and file existence? Maybe depends on if if time is unlimited or fixed? This has many possibilities for problems if not careful...
    • Do we make time index consistent for "missing" times that weren't written based on the template and application starting time (current History we do not), i.e if the template is a something_%y4%m2%d2.nc4, instantaneous output, we have a frequency of 6 hours relative to 21z, and we start the application at 18z, and the first time is written is 21z do we write to time index 4 or 1 and is the start time of the file 3z or 21z, as there are "naturally" 4 times per file starting at 3z, but for this first time the file didn't exist it really is the first time we write to the file? Once again in the old History we would write to time index 1, with a start time of 21z. The question is do we want to enforce consistency across ALL files, even if it means some files have time slices that are all missing?
    • ForceZeroOffset from old History (aka don't timestep time-averaged files at midpoint of averaging period which is default). Related question, for time-averaged collections, metadata that says the variable is time averaged with the range for example?
    • Do the answers to any of these questions above require a syntactical decision now before presenting outside of SI team?
  • chunking, another per-collection with override power in each variable?
  • for that matter, deflate, bit shaving, global to collection, overridable per variable?
  • Can we eliminate special monthly keyword?
  • make a decision where to put the start/stop collection time
  • can each field override the time "mode"(instantaneous vs mean vs min vs max)? MAPL2 History let the user override time-averaging to min or max, if collection was instantaneous had no effect
  • vertical grid and vertical regridding specification
  • regex
  • level selection, is that just a variant of "vertical" regridding? Does this belong elsewhere? Because you could do this on a dimension that has nothing to do with the special "vertical" ungridded dimension, it could be just subsetting ANY ungridded dimension, or some sort of generic "subsetting" syntax?
  • Variable sets, do we even need this?
  • do we expand output limitations mixed/center edge, 4-D variables i.e. vertical + ungridded dimension, not a syntax question per-se but something we should decide on for initial implementation
  • tile regridding (assume ESMF will have this done so don't have to worry about it like we do now in History so nothing to actually do syntax wise?)

Proposed Format

For reference, all keywords in old history collection can be found here.

Global Metadata

version: 2
allow_overwrite: false.
experiment:
  id: MAPL-v3
  source: GEOSgcm-v10.22.0
  description:  >
    long string across
    many lines"

Turn collections on or off

  • Note the DAS needs the ability to turn off mid run, see end datetime in time handling section, but maybe it should go here, i.e. whether a collection is active has the time constrains here?
active_collections:
  - geosgcm_prog
  - geosgcm_surf

If the stop time or start time if you want it to turn on later in the run were embedded here, what would that even look like, like the values of the list could either be a scalar or a map?

active_collections:
  # this one has no constraints, on all the time
  - geosgcm_tend
  # as an time interval give 2 iso times, but what if we want this to be open ended?
  - geosgcm_prog: [2004-01-10T09:00:00, 2004-01-11T03:00:00]
  # as separate start/end, if one or the other is not provided assume open
  - geosgcm_surf: {end_time: 2004-01-10T09:00:00}
  - geosgcm_turb: {start_time: 20004-01-01T9:00:00}

Horizontal and Vertical Grid Definitions

  • What about things like selecting certain levels for output?
geoms:
  geom_1: &geom_1
    class: latlon
    im: 360
    jm: 180
    pole: PE
    dateline: DE
  geom_2: &geom_2
    class: swath
  geom_3: &geom_3
    class: trajectory
  geom_4: &geom_4
    class: station
  geom_5: &geom_5 
    class: cubed-sphere

# This is just copying what is in the old collection...
vertical_grids:
  pressure-levels: &pressure-levels 
    ref_var: DYN.PLE
    function: log
    levels: [1000, 975, 950, 925, 900, 875, 850, 825, 800, 775, 750, 725, 700, 650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 150, 100, 70, 50, 40, 30, 20, 10, 7, 5, 4, 3, 2, 1, 0.7, 0.5, 0.4, 0.3, 0.2, 0.1, 0.07, 0.05, 0.04, 0.03, 0.02]
    unit: hPA # What if this doesn't match variable, MAPL2 had a conversion factor you could add

synoptic_start: &synoptic_start 2000-04014T21:00:00 synoptic_end: &synoptic_end 2000-04-15T03:00:00

Time Handling specification

time_specs:
  # all times in ISO times
  # all frequencies in ISO durations
  # instantaneous relative to reference time
  six_hourly: %six_hourly
    mode: instantaneous
    frequency: P6H # ISO duration
    ref_time: T21H  # ISO time with no date
    start_datetime: *synoptic_time # optional, default is start of calender
    end_datetime: *synoptic_end # optional, default is end of universe
  # instantaneous example on heartbeat
  hearbeat: &heartbeat
    # if frequency heartbeat, ref time is disallowed
    mode: instantaneous # instantaneous (default), time-averaged, min, max
    frequency: heartbeat # not default! dt of clock passed in...
  # time averaged output every 6 hours
  sixh_avg21: &sixh_avg21
    mode: time-averaged
    frequency: P6H # default is dt of clock passed in (heartbeat)
    ref_time: 21H # if frequency not heartbeat, must specify reference time
  # ref_time disallowed because frequency is a non-constant duration
  # natural ref_time is clearly beginning of month
  monthly: &monthly
    mode: time-averaged 
    frequency: P1M

Variable sets

variable_sets:
  dyn:
    ...
  rad:

Collection Definition

collections:
  geosgcm_prog:
    geom: *geom_1
    vertical_grid: *pressure_levels
    time_spec: *daily_avg21
    template: %e.%c.%y4%m2%d2_%h2%n2z.nc4
    # anything after this would have sensible defaults
    archive: %c/Y%y4 # do we need this?
    file_format: netcdf # default, will we even support another?
    # the following can be overridden per-entry in the fields
    compression_level: 1 # default 0
    bit_shave: 12 # default no bit shaving, all kept
    regrid_method: conservative # default bilinear
    chunking: [180,90,1,1]
    # The idea here is that the delimiter is how you separate the component/field
    delimiter = '.'
    var_list:
      # basic single field output
      PHIS: {expr: AGCM.PHIS, regrid_method: vote,  vertical_method: ..., time_regrid: min/max/mean, units: 'ft', chunking: [90, 45, 1, 1] }
      PHIS: {expr: AGCM.PHIS}   #  Gocart has . in name..., sigh

      # debate if we should allow both or only 1 or the other if no alias desired
      - AGCM.PHIS
      - [AGCM.PHIS]

      # two different ways to expression the item and alias
      - [AGCM.PHIS, phis]
      - {name: AGCM.PHIS, alias: phis} 

      # you many want to import a field into the component grid comp
      # for use in expression later, but not actually write to file
      - {name: DU.x, exclude: true, units: feet]

      # vector, then vector with alias
      - DYN.agrid_wind
      - [DYN.agrid_wind, [u, v]]
      - {name: DYN.agrid_wind, alias: [u, v]}

      # expressions
      - {expr: DU.x + SU.y, alias: weird}
      # if items in expression are a vector, must specify which component      
      - {expr: sqrt(DYN.agrid_wind[1]**2+dyn.agrid_wind[2]**2), alias: wind_speed} 
      # or "dive" into vector like any other container?
      - {expr: sqrt(DYN.agrid_wind.u**2+dyn.agrid_wind.v**2), alias: wind_speed} 

      
      # bundles, make the delimiter a general "diving"
      # From PHYSICS component, get MTRI bundle, from MTRI bundle get NI::NO3an1M
      - PHYSICS.MTRI.NI::NO3an1M
      - [PHYSICS.MTRI.NI::NO3an1IM, NISV]
      - {name: PHYSICS.NI::NO3an1IM, alias: NISV}

      # example of override collection defaults for an entry
      - {name: AGCM.PHIS, alias: phis, chunking: [90,90,1], compression_level: 2, bit_shave: 14, regrid_method: bilinear} 

       

File Duration Behavior

Old History

Old history has the "duration" keyword for collection but very problematic. New options explored below.

New file "duration" behavior

Note all options start with the same premise, the time to be written and the template determine which file you write when it is time to write, NOTHING ELSE, all the variations effect WHICH TIME INDEX you write to WITHIN A FILE.

Here are 4 variations

If time is unlimited, we allow appending, no missing time slices

  1. Each time we History decides to write, it evaluates the template, this IS the file that will be written to, no more no less, the evaluated template is based on the "some time" (what time it is depends on things like is this time-averaged for example) provided by History
  2. It will write the next time index in the file, with a time value of this "some time" in point 1
  3. Check if the evaluated file from the template has not been written to this execution, if it has not we have the following options
    • Check if the file exists, if not create it, by definition the "next" time index is 1, if time unlimited no need to determine anything else, store this so we know what time index for the next write
    • Check if file exists, if it does exist (presumably from a previous segment, but then what if the file just happens to have the same name, would need good checking here that history really did write it etc...), determine how many times have been written to that file. Keep appending to it with the appropriate next time index
  4. If the file has already been written to, well then you know the next time index so write to that time index

If time is fixed dimension or no appending to existing file, no missing time slices in the file

  1. Each time we History decides to write, it evaluates the template, this IS the file that will be written to, no more no less, the evaluated template is based on the "some time" (what time it is depends on things like is this time-averaged for example) provided by History.
  2. It will write the next time index in the file, with a time value of this "some time" in point 1
  3. If the file has not been written to this execution, check if the file we want to write to already exists, if so die. If not determine how many time slices are left go into this file until we hit a new file by using the frequency and ref_time, create said file, start time index at 1 for bookkeeping purposes. Write time index 1
  4. If the file has been written to already, write to "next" time index

Tom's idea, no appending to old file

  1. Each time we History decides to write, it evaluates the template, this IS the file that will be written to, no more no less, the evaluated template is based on the "some time" (what time it is depends on things like is this time-averaged for example) provided by History.
  2. If the file has not been written to this execution, check if the file we want to write to already exists, if so die. If not determine the TOTAL NUMBER OF TIME INDICES IN THE FILE that go into this file based on the frequency, ref time, and template. Create said file.
  3. Write to the appropriate index based on the time given the frequency and reference time. Note this may mean some time slices will never be written to so some variables for a given time range will undefined. Better compress these...

Tom's idea, appending to old file from previous run

Like the above, but this time rather than die if the file exists, just use it (rather than creating a file) presuming a previous execution created it "correctly", just write to the appropriate index based on time, if the previous execution did it's job right, that will just work.

Still could end up with files at beginning or end of a long run with missing data.