Pooling geographic features - NOAA-OWP/wres GitHub Wiki

Why might I want to pool geographic features?
How does the WRES combine information from different geographic features?
How can I declare that pools should be composed of several geographic features?
Can I declare features from a much larger group or region without listing each feature separately? Isn’t there a simpler way to do this?
Declaring a feature correlation or “tuple” of names is a bit cumbersome. Is there a simpler way to do this?
How are thresholds handled when pooling time-series data from several features?
What does a complete declaration that contains a feature group look like?

Why might I want to pool geographic features?

An evaluation answers a question (e.g., about forecast quality). When that question is concerned with a geographic area or region, it may be appropriate to gather and pool together data from several geographic features. One advantage of pooling data from several geographic features is an increased sample size, which in turn leads to reduced sampling uncertainty or increased confidence that the evaluation statistics are “meaningful”.

Of course, when the answer to a question depends heavily on a particular geographic feature, it may not be appropriate or useful to pool several features. For example, land surface variables may vary over short distances or depend on site-specific conditions, while atmospheric variables often vary more smoothly and may be comparable over larger regions (e.g., climate regions). Nevertheless, transformations of hydrologic variables, such as a probability of flooding (a categorical variable) may be comparable across several geographic features and hence amenable to pooling.

In summary, pooling across geographic features requires careful thought about the evaluation question posed and whether the advantages (e.g., increased sample size) outweigh the disadvantages (e.g., potential to conflate different evaluation behaviors, such as biases operating in different directions).

How does the WRES combine information from different geographic features?

It is possible to nominate a group of features whose evaluation pairs will be pooled together and used to compute a single set of statistics. Pooling time-series data should not be confused with pooling statistics, which is an alternative approach to combining information from several geographic features, and is described in Evaluation Summary Statistics. These two approaches have different strengths and weaknesses, a discussion of which is beyond the scope of this wiki. Again, in this context, the approach to pooling information across geographic features is to pool the time-series data. For example, if geographic feature X is associated with a set of verification pairs, A, and geographic feature, Y, is associated with a set of verification pairs B, then combining these pairs into an overall pool involves nothing more than collecting them together or forming the union, A∪B.

How can I declare that pools should be composed of several geographic features?

First, consider the following example declaration of a single geographic feature, which is supported in WRES 6.14 and later:

features:
  - observed: '09165000'
    predicted: DRRC2

This declaration asserts that “the feature whose name in the observed data sources is 09165000 should be paired together with the feature whose name in the predicted data sources is DRRC2”. This is also known as a feature correlation. Each name has a unique meaning in its context. Here, the context is the side of data being evaluated, which in turn corresponds to a data source. The two feature names within this feature correlation represent a single geographic entity in the real world, namely the location of a USGS streamflow gage on the Dolores River, near Rico, Colorado. Note that, for a name that corresponds to a number, the name should be single- or double-quoted to distinguish the data type as a string, rather than a number.

In this hypothetical evaluation, the observed data originates from the USGS National Water Information System (NWIS, not shown) and the observed name of 09165000 was assigned by the USGS feature naming authority. The predicted data originates from the NWS Advanced Hydrologic Prediction Service (AHPS, not shown) and the predicted name of DRRC2 was assigned by the National Weather Service feature naming authority.

Next, consider the following declaration:

feature_groups:
  - features:
      - observed: '09165000'
        predicted: DRRC2

Here, the declaration of the features is unchanged from the earlier example and is simply wrapped in a new context, feature_groups, which renders explicit the nature of the composition as a group of (one) geographic feature. In other words, the earlier declaration of a features is a shorthand for the above, explicit declaration of a singleton feature group.

Unlike a features, a feature_groups can be named. For example, this feature group might be called “Dolores CO”:

feature_groups:
  - name: Dolores CO
  - features:
      - observed: '09165000'
        predicted: DRRC2

Crucially, additional features can be added to this feature group. For example, the location immediately downstream of DRRC2 is DOLC2, which corresponds to a USGS streamflow gage near Dolores, CO (09166500).

feature_groups:
  - name: Dolores CO
  - features:
      - observed: '09165000'
        predicted: DRRC2
      - observed: '09166500'
        predicted: DOLC2

Can I declare features from a much larger group or region without listing each feature separately? Isn’t there a simpler way to do this?

Yes, if you have access to the Water Resources Data Service (WRDS) geographic feature service, which is a web service hosted by the National Water Center, AL. This service provides an interface (or HTTP API) that allows other software, such as the WRES, to form requests about geographic features, including how to correlate features that are named by different feature authorities, such as the USGS and NWS. In general, a feature correlation may contain any pair of geographic features. For example, it is not necessary for the two features to be collocated or correspond to the same geographic entity in the real world. However, the WRDS feature service focuses on geographically related features. These feature correlations generally correspond to the same geographic entity (e.g., a streamflow gage) in two different hydrofabrics, rather than arbitrary feature correlations.

For users of the WRES that have access to the WRDS feature service, which includes those on the National Water Center, AL, network or using the COWRES, you can benefit from the capabilities of this service, which include obtaining feature correlations for named features and for geographic regions, such as the operating area of the California Nevada River Forecast Center or the U.S. State if Mississippi. The WRDS hostname is omitted in examples below; if you need the hostname, refer to the COWRES user support wiki or contact the WRES team.

By way of example, the following WRES declaration evaluates all geographic features located within the U.S. State of Alabama:

feature_service:
  uri: https://[WRDS]/api/location/v3.0/metadata/
  group: state
  value: AL

Here (and elsewhere in this wiki), the [WRDS] host is a placeholder for the host of a WRDS feature service, which is currently only available for NWS users. If you are unsure whether you can (or how to) access this service, please contact the WRES team.

In this example, each feature is evaluated separately, i.e., separate statistics are generated for every feature that is located within AL. In order to pool all of the features in AL into a single feature group with a name of AL, the pool parameter can be used:

feature_service:
  uri: https://[WRDS]/api/location/v3.0/metadata/
  group: state
  value: AL
  pool: true

Declaring a feature correlation or “tuple” of names is a bit cumbersome. Is there a simpler way to do this?

Yes, if you have access to the WRDS feature service. In this case, feature correlations can be determined by the WRDS feature service using partial information about each feature. For example, in order to correlate the USGS feature 09165000 with the NWS feature DRRC2, the following declaration is supported:

    features:
      - predicted: DRRC2
    feature_service: https://[WRDS]/api/location/v3.0/metadata/

In this case, the feature is declared with only the predicted feature name of DRRC2 and the WRDS feature service is used to interpolate the observed feature name of 09165000. The same style of declaration can be used in a feature_group context. For example, the following declaration is supported:

feature_groups:
  - name: Dolores CO
    features:
      - predicted: DRRC2
      - predicted: DOLC2
feature_service: https://[WRDS]/api/location/v3.0/metadata/

How are thresholds handled when pooling time-series data from several features?

When pooling pairs from several features, an attempt is made to correlate thresholds in order to compute statistics from comparable thresholds. The following attributes of thresholds are considered when forming these correlations (any one is sufficient to create a correlation, and they are considered in this order):

Thresholds that have the same name across different features, such as FLOOD. This applies to thresholds supplied in CSV format or from the WRDS feature service (i.e., the WRDS feature service which serves different types of thresholds or key-value sets), both of which support threshold naming;
For probability thresholds (e.g., quantiles), the threshold probabilities. For example, a threshold of > Pr=0.9 for N separate features will be considered the same threshold; and
For value thresholds that do not have probabilities associated with them (i.e., are not quantiles), the threshold values. For example, a threshold of > 5 millimeters for N separate features will be considered the same threshold.

What does a complete declaration that contains a feature group look like?

The declaration below is valid in WRES 6.14 and later. This declaration evaluates single-valued streamflow forecasts from the National Water Model (NWM) against streamflow observations from the USGS NWIS for a feature group of 11 features from the U.S. state of New Mexico (labelled, NM). The observed feature names are USGS gage identifiers and the predicted feature names are NWM feature identifiers.

observed:
  label: USGS
  sources: https://nwis.waterservices.usgs.gov/nwis/iv
  variable:
    name: "00060"
    label: streamflow
predicted:
  label: NWM Short Range
  sources:
    uri: /data/nwmVector/
    interface: nwm short range channel rt conus
  variable: streamflow
feature_groups:
  - name: NM
    features:
      - {observed: "07207000", predicted: "20059116"}
      - {observed: "07215500", predicted: "20044778"}
      - {observed: "07216500", predicted: "20044782"}
      - {observed: "07211500", predicted: "20050371"}
      - {observed: "07227100", predicted: "20017399"}
      - {observed: "07206000", predicted: "20058478"}
      - {observed: "07207500", predicted: "20058852"}
      - {observed: "07221500", predicted: "20052023"}
      - {observed: "07203000", predicted: "20066137"}
      - {observed: "07227000", predicted: "20031587"}
      - {observed: "07208500", predicted: "20060156"}
reference_dates:
  minimum: 2017-08-07T23:00:00Z
  maximum: 2017-08-08T23:00:00Z
reference_date_pools:
  period: 1
  frequency: 1
  unit: hours
valid_dates:
  minimum: 2017-08-07T23:00:00Z
  maximum: 2017-08-09T17:00:00Z
lead_times:
  minimum: 0
  maximum: 18
  unit: hours
lead_time_pools:
  period: 18
  frequency: 18
  unit: hours