1. What is the simplest possible evaluation I can declare?
2. What is the format of the evaluation language?
3. How do I declare the datasets to evaluate?
4. How do I declare geographic features to evaluate?
5. How do I filter the time-series data to consider only specific dates or times?
6. How do I declare the measurement units for the evaluation?
- 6.1. What units does WRES accept and how do I declare them?
- 6.2. What happens if WRES does not understand a unit supplied in a data source?
7. How do I filter the time-series data to consider only values within a range?
8. How do I declare thresholds to use in an evaluation?
9. How do I declare the pools of data that should be evaluated separately?
10. How do I declare the desired timescale (e.g., accumulation period)?
- 10.1. How do I declare a fixed timescale?
- 10.2. Can I declare a timescale that spans certain dates?
11. How do I declare the metrics to evaluate?
12. How do I declare summary statistics?
13. How do I ask for sampling uncertainties to be estimated?
14. How do I declare output formats to write?
15. Are there any other options?
16. Do you have some examples of complete declarations?
17. Does the declaration language use a schema?
18. What does this error really mean?

The declaration language refers to the language employed by the WRES to declare the contents of an evaluation.

1. What is the simplest possible evaluation I can declare?

The simplest possible evaluation contains the paths to each of two datasets whose values will be compared or evaluated:

observed: observations.csv
predicted: predictions.csv

In this example, the two datasets contain time-series values in CSV format and they are located in the user’s home directory, otherwise absolute paths must be declared. The WRES will automatically detect the format of the supplied datasets. The format requirements for CSV files are described here: Format Requirements for CSV Files.

In this example, the WRES will make some reasonable choices about other aspects of the evaluation, such as the metrics to compute (depending on the data it sees) and the statistics formats to write.

The language of “observed” and “predicted” is simply intended to clarify the majority use case of the WRES, which is to compare predictions and observations. When computing error values, the order of calculation is to subtract the observed values from the predicted values. Thus, a negative error means that the predictions are too low and a positive error means they are too high. Beyond this, the WRES is agnostic about the content or origin of these datasets and simply views them as two time-series datasets. For example, observed or measured values could be used in both the observed and predicted slots, if desired.

2. What is the format of the evaluation language?

An evaluation is declared to WRES in a prescribed format and with a prescribed grammar. The format or “serialization format” is YAML, which is a recursive acronym, “YAML Ain’t Markup Language”. The evaluation language itself builds on this serialization format and contains the grammar understood by the WRES software. For example, datasets can be declared, together with any optional filters, metrics or statistics formats to create.

It may be interesting to know that YAML is a superset of JSON, which means that any evaluation declared to WRES using YAML has an equivalent declaration in JSON, which the WRES will accept. For example, the equivalent minimal evaluation in JSON is:

{
  "observed": "observations.csv",
  "predicted": "predictions.csv"
}

As you can see, YAML tends to be cleaner and more human readable than JSON, but JSON is perfectly acceptable if you are familiar with it and prefer to use it.

If you are curious, the following resources provide some more information about YAML:

https://en.wikipedia.org/wiki/YAML [comprehensive description and examples]
https://www.yamllint.com/ [this will tell you whether your declaration is valid YAML]

3. How do I declare the datasets to evaluate?

3.1. How do I declare a baseline dataset?

As indicated above, the basic datasets to compare are observed and predicted:

observed: observations.csv
predicted: predictions.csv

Additionally, a baseline dataset may be declared as a benchmark for the predicted dataset.

observed: observations.csv
predicted: predictions.csv
baseline: baseline_predictions.csv

For example, when computing a mean square error skill score, the mean square error is first computed by comparing the predicted and observed datasets and then, separately, by comparing the baseline and observed datasets and then, finally, by comparing the two mean square error scores in a skill score.

3.2. How do I declare covariate datasets?

As of v6.22, covariate datasets can be used to filter evaluation pairs. For example, precipitation forecasts may be evaluated conditionally upon observed temperatures (a covariate) being at or below freezing. Further information about covariates is available here: Using covariates as filters.

In the simplest case, involving a single covariate dataset without additional parameters, the covariate may be declared in the same way as other datasets (note the plural form, covariates, because there may be one or more):

observed: observations.csv
predicted: predictions.csv
baseline: baseline_predictions.csv
covariates: covariate_observations.csv

In this example, the evaluation pairs will include only those valid times when the covariate is also defined.

Unlike the observed, predicted and baseline datasets, more than one covariate may be declared using a list. For example:

observed: observations.csv
predicted: predictions.csv
baseline: baseline_predictions.csv
covariates:
  - sources: precipitation.csv
    variable: precipitation
  - sources: temperature.csv
    variable: temperature

In this case, the list includes two covariates, one that contains precipitation observations and one that contains temperature observations.

Covariates may be declared with a minimum and/or maximum value. This will additionally filter evaluation pairs to only those valid times when the covariate meets the filter condition(s). For example:

observed: observations.csv
predicted: predictions.csv
baseline: baseline_predictions.csv
covariates:
  - sources: precipitation.csv
    variable: precipitation
    minimum: 0.25
  - sources: temperature.csv
    variable: temperature
    maximum: 0

In this case, the evaluation pairs will include only those valid times when the temperature is at or below freezing, 0°C, and the precipitation equals or exceeds 0.25mm. The measurement units correspond to the unit in which the covariate data is defined. Currently, it is not possible to transform the measurement unit of a covariate prior to filtering. In addition, the values must be declared at the evaluation time_scale, whether or not this is declared explicitly. For example, if the evaluation is concerned with daily average streamflow, then each covariate filter should be declared as a daily value. However, the time scale function can be declared separately for each covariate using the rescale_function. For example:

observed: 
  sources: observations.csv
  variable: streamflow
predicted: 
  sources: predictions.csv
  variable: streamflow
covariates:
  - sources: precipitation.csv
    variable: precipitation
    minimum: 0.25
    rescale_function: total
  - sources: temperature.csv
    variable: temperature
    maximum: 0
    rescale_function: minimum
time_scale:
  period: 24
  unit: hours
  function: mean

In this case, the subject of the evaluation is daily mean streamflow and the streamflow pairs will include only those valid times when the daily total precipitation exceeds 0.25mm and the minimum daily temperature is below freezing.

Otherwise, all of the parameters that can be used to clarify an observed or predicted dataset can be used to clarify a covariate dataset (see How do I clarify the datasets to evaluate, such as the variable to use?).

3.3. How do I declare a dataset that composes multiple data sources or URIs?

You can declare multiple datasets by listing them. In this regard, YAML has two styles for collections, such as arrays, lists and maps. The ordinary or “block” style includes one item on each line. For example, if the observed dataset contains several URIs, they may be declared as follows:

observed: observed.csv
predicted: 
  - predictions.csv
  - more_predictions.csv
  - yet_more_predictions.csv

In this context, the dashes and indentations are important to preserve. You should use two spaces for each new level of indentation, as in the example above.

Alternatively, you may use the “flow” style, which places all items in a continuous list or array and uses square brackets to begin and end the list:

observed: observed.csv
predicted: [predictions.csv, more_predictions.csv, yet_more_predictions.csv]

3.4. How do I clarify the datasets to evaluate, such as the variable to use?

In some cases, it may be necessary to clarify the datasets to evaluate. For example, if a URI references a dataset that contains multiple variables, it may be necessary to clarify the variable to evaluate. In other cases, it may be necessary to clarify the time zone offset associated with the time-series or to apply additional parameters that filter data from a web service request.

When clarifying a property of a dataset, it is necessary to distinguish it from the other properties. For example, if a URI refers to a dataset that contains some missing values and the missing value identifier is not clarified by the source format itself, then it may be necessary to clarify this within the declaration:

observed:
  - uri: some_observations.csv
    missing_value: -999.0
  - more_predictions.csv
predicted: some_predictions.csv

Here, the some_observations.csv has now acquired a uri property, in order to distinguish it from the missing_value.

Likewise, it may be necessary to clarify some attribute of a dataset as a whole, such as the variable to evaluate (which applies to all sources of data within the dataset). In that case, it would be further necessary to distinguish the data sources from the variable:

observed:
  sources:
    - uri: some_observations.csv
      missing_value: -999.0
    - more_predictions.csv
  variable: streamflow
predicted: some_predictions.csv

The following table contains the options that may be used to clarify either an observed or predicted dataset as of v6.14, with examples in context. You can also examine the schema, Does the declaration language use a schema?, which defines the superset of all possible evaluations supported by WRES.

Option	Purpose	Examples in context
`sources`	To clarify the list of sources to evaluate when other options are present for the dataset as a whole.	observed: sources: some_observations.csv variable: streamflow predicted: some_predictions.csv
`uri`	To clarify the URI associated with a dataset when other options are present for the dataset associated with that URI.	observed: - uri: some_observations.csv missing_value: -999.0 predicted: some_predictions.csv
`variable`	To clarify the variable to evaluate when a data source contains multiple variables. Optionally, one or more variable `aliases` may be included, which will be treated as equivalent to the named variable.	observed: sources: some_observations.csv variable: streamflow predicted: some_predictions.csv observed: sources: some_observations.csv variable: name: HG aliases: [HT,HP] label: streamflow predicted: some_predictions.csv
`feature_authority`	To clarify the feature authority used to name features. This may be required when correlating feature names across datasets. For example, to correlate a USGS Site Code of `06893000` with a National Weather Service "Hankbook 5" feature name of `KCDM7`, it is either necessary to explicitly correlate these two names in the declaration or it is necessary to use one of the names and to resolve the correlated feature with a feature service request. For this request to succeed, the feature service will need to know that `06893000` is a `usgs site code` or, equivalently, that the `KCDM7` is an `nws lid`. The supported values for the `feature_authority` are: - `nws lid`, - `usgs site code`, - `nwm feature id`; and - `custom`, which is the default.	observed: sources: some_observations.csv feature_authority: usgs site code predicted: some_predictions.csv
`type`	In rare cases, it may be necessary to clarify the `type` of dataset. For example, when requesting time-series datasets from web services that support multiple types of data, it may be necessary to clarify the type of data required. The supported values for the `type` are: - `ensemble forecasts`, - `single valued forecasts`, - `observations`, -`simulations`, and - `analyses`.	observed: some_observations.csv predicted: sources: some_predictions.csv type: single valued forecasts
`label`	A user-friendly label for the dataset, which will appear in the statistics formats, where appropriate.	observed: some_observations.csv predicted: label: a very special dataset sources: some_predictions.csv
`ensemble_filter`	A filter that selects a subset of the ensemble forecasts to include in the evaluation or exclude from the evaluation. Only applies to datasets that contain ensemble forecasts. By default, the named members are included.	observed: some_observations.csv predicted: sources: some_predictions.csv ensemble_filter: 2009 observed: some_observations.csv predicted: sources: some_predictions.csv ensemble_filter: members: - 2009 - 2010 exclude: false
`time_shift`	A time shift that is applied to the valid times associated with all time-series values. This may be used to to help pair values whose times are not exactly coincident.	observed: sources: some_observations.csv time_shift: period: -2 unit: hours predicted: some_predictions.csv
`time_scale`	The timescale associated with the time-series values. This may be necessary when the timescale is not explicitly included in the source format. In general, a time-scale is only required when the time-series values must be rescaled in order to form pairs. For example, if the `observed` dataset contains instantaneous values and the `predicted` dataset contains values that represent a 6-hour average, then the `observed` time-series values must be "upscaled" to 6-hourly averages before they can be paired with their corresponding `predicted` values. Upscaling to a desired time scale is only possible if the existing timescale is known/declared.	observed: sources: some_observations.csv time_scale: function: mean period: 24 unit: hours predicted: some_predictions.csv
`time_zone_offset`	The time zone offset associated with the dataset. This is only necessary when the source format does not explicitly identify the time zone in which the timestamps are recorded. Accepts either a quantitative time zone offset or, less precisely, a time zone shorthand, such as `CST` (Central Standard Time). When using a numeric offset, the value must be enclosed within single or double quotes to clarify that it should be treated as a time zone offset and not a number.	observed: sources: some_observations.csv time_zone_offset: '-0600' predicted: some_predictions.csv

The following table contains the additional options that may be used to clarify a baseline dataset as of v6.14, with examples in context. For the avoidance of doubt, these options extend the options available for an observed or predicted dataset.

Option	Purpose	Examples in context
`persistence`	Allows for the declaration of a persistence baseline from a prescribed data source. The persistence time-series will be generated using the specified `order` or "lag", which corresponds to the value before the current time that will be persisted forwards into the future. For example, "1" means that the value from the persistence source that occurs one timestep prior to the current time will be persisted forwards. In this context, "current time" means the valid time of a non-forecast source or the reference time of a forecast source. The default value for the `order` is 1.	observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_observations.csv method: persistence observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_observations.csv method: name: persistence order: 1
`climatology`	Allows for the declaration of a climatology baseline from a prescribed data source. For a given valid time, the climatology will contain the value from the prescribed data source at the corresponding valid time in each historical year of record, other than the year associated with the valid time (which is typically the "verifying observation"). The period associated with the climatology may be further constrained by a `minimum_date` and/or a `maximum_date`. Optionally, the climatology may be converted to a single-valued dataset by prescribing an `average`. The supported values for the `average` are: - `mean`; and - `median.` The default value for the `average` is `mean`.	observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_observations.csv method: climatology observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_observations.csv method: name: climatology minimum_date: 1980-01-01T00:00:00Z maximum_date: 2020-12-31T23:59:59Z average: median
`separate_metrics`	A flag (`true` or `false`) that indicates whether the same metrics computed for the `predicted` dataset should also be computed for the `baseline` dataset. When `true`, all metrics will be computed for the `baseline` dataset, otherwise the `baseline` will only appear in skill calculations for the `predicted` dataset.	observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_other_predictions.csv separate metrics: true

The following table contains the additional options that may be used to clarify covariates as of v6.23, with examples in context. For the avoidance of doubt, these options extend the options available for an observed or predicted dataset.

Option	Purpose	Examples in context
`minimum`	Allows for the declaration of a `minimum` value the covariate should take. Only those evaluation pairs will be considered when the covariate value is at or above the `minimum` value at the same valid time. The measurement unit is the unit in which the covariate dataset is supplied. The time scale is the evaluation time scale.	observed: some_observations.csv predicted: some_predictions.csv covariates: - sources: some_covariate_observations.csv minimum: 14
`maximum`	Allows for the declaration of a `maximum` value the covariate should take. Only those evaluation pairs will be considered when the covariate value is at or below the `maximum` value at the same valid time. The measurement unit is the unit in which the covariate dataset is supplied. The time scale is the evaluation time scale.	observed: some_observations.csv predicted: some_predictions.csv covariates: - sources: some_covariate_observations.csv maximum: 28
`rescale_function`	A function to use when rescaling the covariate dataset to the evaluation time scale.	observed: some_observations.csv predicted: some_predictions.csv covariates: - sources: some_covariate_observations.csv maximum: 28 rescale_function: total

4. How do I declare geographic features to evaluate?

4.1. When do I need to declare geographic features?

Geographic features may be declared explicitly by listing each feature to evaluate (How do I declare a list of geographic features to evaluate?). Alternatively, they may be declared implicitly, either by declaring a named region to evaluate (How do I declare a region to evaluate without listing all of the features within it?) or by declaring a geospatial mask (How do I declare a spatial mask to evaluate only a subset of features?).

There are three scenarios in which you should declare the geographic features to evaluate, namely:

When the declared datasets contain more features than you would like to evaluate, i.e., when you would like to evaluate a subset of the features for which data is available;
When you are reading data from a web-service, otherwise the evaluation would request a potentially unlimited amount of data; or
When there are multiple geographic features present within the declared datasets and two or more of the datasets use a different feature naming authority. In these circumstances, it is necessary to declare how the features are correlated with each other.

If you fail to declare the geographic features in any of these scenarios, you can expect an error message.

Conversely, it is unnecessary to declare the geographic features to evaluate when:

There is a single geographic feature in each dataset; or
There are multiple geographic features and:
- All of the datasets use a consistent feature naming authority; and
- The evaluation should include all of the features discovered.

4.2. How do I declare a list of geographic features to evaluate?

Different datasets may name geographic features differently. Formally speaking, they may use different “feature authorities”. For example, time-series data from the USGS National Water Information System (NWIS) uses a USGS Site Code, whereas time-series data from the National Water Model uses a National Water Model feature ID.

As such, the software allows for as many feature names as sides of data, i.e., three (observed, predicted and baseline). This is referred to as a “feature tuple”.

When all sides of data have the same feature authority, and this can be established from the other declaration present, it is sufficient to declare the name for only one side of data. Otherwise, the fully qualified feature tuple must be declared or a feature service used to establish the missing names.

In the simplest case, where each side of data has the same feature authority and the aim is to pair corresponding feature names, the features may be declared as follows:

features:
  - DRRC2
  - DOLC2

Where DRRC2 and DOLC2 are the names of two geographic features in the National Weather Service “Handbook 5” feature authority. In this example, the evaluation will produce statistics separately for each of DRRC2 and DOLC2.

In the more complex case, where each side of data has a separate feature authority or the feature authority cannot be determined from the other information present, then the features must be declared separately for each side of data, as follows:

features:
  - {observed: '07140900', predicted: '21215289'}
  - {observed: '07141900', predicted: '941030274'}

In this example, the feature authority for the observed data is a USGS Site Code and the feature authority for the predicted data is a National Water Model feature ID. The quotes around the names indicate that the values should be treated as characters, rather than numbers.

4.3. Can I declare a geographic feature authority explicitly?

Yes. Often, this is unnecessary because the software can determine the feature authority from the other information present. For example, consider the following declaration:

observed:
  sources:
    - uri: https://nwis.waterservices.usgs.gov/nwis/iv
      interface: usgs nwis
  variable:
    name: '00060'
predicted:
  sources:
    - uri: data/nwmVector/
      interface: nwm short range channel rt conus
  variable: streamflow

In this case, it is unambiguous that the observed data uses a USGS Site Code because the source interface is usgs nwis. Likewise, the predicted data uses a National Water Model feature ID because the source interface is a National Water Model type, nwm short range channel rt conus. In short, if the source interface is declared, it should be unnecessary to define the geographic feature authority.

In other cases, time-series data may be obtained from a file source whose metadata is unclear about the feature authority. In fact, none of the time-series data formats currently supported by WRES include information about the feature authority. In this case, the feature authority may be declared explicitly:

observed:
  sources: data/DRRC2QINE.xml
  feature_authority: nws lid
predicted:
  sources: data/drrc2ForecastsOneMonth/
  feature_authority: nws lid

The above unlocks the following as valid declaration:

observed:
  sources: data/DRRC2QINE.xml
  feature_authority: nws lid
predicted:
  sources: data/drrc2ForecastsOneMonth/
  feature_authority: nws lid
features:
  - DRRC2

Conversely, in the absence of the declared feature_authority for each side of data, this would be required:

observed:
  sources: data/DRRC2QINE.xml
predicted:
  sources: data/drrc2ForecastsOneMonth/
features:
  - {observed: DRRC2, predicted: DRRC2}

4.4. What if I don’t know the relationship between different geographic features?

If you are using datasets with different feature authorities and are either unaware of how features relate to each other across the different feature authorities or prefer not to declare them manually, then you can use the Water Resources Data Service (WRDS) feature service to establish feature correlations. The WRDS is available to those with access to web services hosted at the National Water Center (NWC) in Alabama. The WRDS hostname is omitted below; if you need the hostname, refer to the COWRES user support wiki or contact the WRES team.

A feature service may be declared as follows:

feature_service: https://[WRDS]/api/location/v3.0/metadata

Where [WRDS] is the host name of the WRDS feature service.

The WRES can ask the WRDS feature service to resolve feature correlations, providing it knows how to pose the question correctly. To pose the question correctly, it must know the feature authority associated with each of the feature names that need to be correlated.

For example, consider the following declaration:

observed:
  sources:
    - uri: https://nwis.waterservices.usgs.gov/nwis/iv
      interface: usgs nwis
  variable:
    name: '00060'
predicted:
  sources:
    - uri: data/nwmVector/
      interface: nwm short range channel rt conus
  variable: streamflow
feature_service: https://[WRDS]/api/location/v3.0/metadata
features:
  - observed: '07140900'
  - observed: '07141900'

In this case, the feature authority of the observed data is a USGS Site Code (the interface is usgs nwis) and the feature authority of the predicted data is a National Water Model feature ID. This allows the WRES to pose a valid question to the WRDS feature service, namely “what are the National Water Model feature IDs that correspond to USGS Site Codes ‘07140900’ and ‘07141900’?”. It is important to note that each feature must be qualified as observed because the feature names are expressed as USGS Site Codes and the observed data uses this feature authority.

4.5. How do I declare a region to evaluate without listing all of the features within it?

You may use the WRDS Feature Service to acquire a list of features for a named geographic region, such as a River Forecast Center (RFC). The WRDS is available to those with access to web services hosted at the National Water Center (NWC) in Alabama. The WRDS hostname is omitted below; if you need the hostname, refer to the COWRES user support wiki or contact the WRES team.

For example, consider the following declaration, which requests all named features within the Arkansas-Red Basin RFC:

feature_service:
  uri: https://[WRDS]/api/location/v3.0/metadata
  group: RFC
  value: ABRFC

Where [WRDS] is the host name of the WRDS feature service. Here, the name of the geographic group understood by WRDS is RFC and the chosen value is ABRFC.

In this example, each of the geographic features contained within ABRFC, as understood by WRDS, would be included in the evaluation. To include features from multiple regions, simply list the individual regions. For example, to additionally includes features from the California Nevada RFC:

feature_service:
  uri: https://[WRDS]/api/location/v3.0/metadata
  - group: RFC
    value: ABRFC
  - group: RFC
    value: CNRFC

By default, each geographic feature is evaluated separately. However, to pool all of the geographic features together and produce a single set of statistics for the overall group, the pool attribute may be declared:

feature_service:
  uri: https://[WRDS]/api/location/v3.0/metadata
  - group: RFC
    value: ABRFC
    pool: true

4.6. How do I declare a spatial mask to evaluate only a subset of features?

Yes, you can declare a spatial mask that defines the geospatial boundaries for an evaluation. This requires a Well Known Text (WKT) string. For example:

spatial_mask: 'POLYGON ((-76.825 39.225, -76.825 39.275, -76.775 39.275, -76.775 39.225, -76.825 39.225))'

In this case, the evaluation will only include (e.g., gridded) locations that fall within the boundaries of the supplied polygon.

Optionally, you may name the region and include a Spatial Reference System Identifier (SRID), which unambiguously describes the coordinate reference system for the supplied WKT: https://en.wikipedia.org/wiki/Spatial_reference_system:

spatial_mask: 
  name: Region south of Ellicott City, MD
  wkt: 'POLYGON ((-76.825 39.225, -76.825 39.275, -76.775 39.275, -76.775 39.225, -76.825 39.225))'
  srid: 4326

4.7. How do I declare a datum offset for each feature?

When evaluating an elevation variable, such as river stage, one or more of the declared datasets may be referenced to a different elevation datum than the remaining datasets. For example, the observed river stage may be referenced to a local gauge datum and the predicted river stage may be referenced to mean sea level. To reconcile these measurements to a common datum for comparison and evaluation, a datum offset may be declared for the relevant dataset associated with each geographic feature. The datum offset is then added to the existing elevation before pairs and statistics are computed. The datum offset is declared in evaluation units. In the following example, a datum offset of 975 feet is added to the elevation data associated with feature 15478000, while no offset is applied to the corresponding feature, BGDA2:

unit: [ft_i]
features:
- observed: 
    name: '15478000'
    offset: -975
  predicted: BGDA2

4.8. How do I calculate statistics for a group of features?

Calculating statistics for a group of features is described in Pooling geographic features.

5. How do I filter the time-series data to consider only specific dates or times?

5.1. When should I filter the time-series data?

You should filter the time-series data in either of these scenarios:

When the goal is to evaluate only a subset of the available time-series data; or
When reading data from a web service, otherwise the evaluation would request a potentially unlimited amount of data.

5.2. What timelines are understood by WRES and how do I constrain them?

An evaluation may be composed of up to three timelines, depending on the type of data to evaluate:

Valid times. These are the ordinary datetimes at which values are recorded. For example, if streamflow is observed at 2023-03-25T12:00:00Z, then its “valid time” is 2023-03-25T12:00:00Z.
Reference times. These are the times to which forecasts are referenced. In practice, there are different flavors of forecast reference times, such as forecast “issued times”, which may correspond to the times at which forecast products are released to the public, or “T0s”, which may correspond to the times at which a forecast model begins forward integration. However, as of v6.14, all reference times are considered par.
Lead times. These are durations rather than datetimes and refer to the period elapsed between a forecast reference time and a forecast valid time. For example, if a forecast is issued at 2023-03-25T12:00:00Z and valid at 2023-03-25T13:00:00Z, then its lead time is “1 hour”.

The last two timelines only apply to forecast datasets.

Datetimes are always declared using an ISO8601 datetime string in Coordinate Universal Time (UTC), aka Zulu (Z) time. Further information about ISO8601 can be found here: https://en.wikipedia.org/wiki/ISO_8601

Each of these timelines can be constrained or filtered so that the evaluation only considers data in between the prescribed datetimes. These bounds always form an open interval, meaning that times that fall exactly on either boundary are included.

Consider the following declaration of a valid time interval:

valid_dates:
  minimum: 2017-08-07T23:00:00Z
  maximum: 2017-08-09T17:00:00Z

In this case, the evaluation will consider all time-series values whose valid times are between 2017-08-07T23:00:00Z and 2017-08-09T17:00:00Z, inclusive.

The following is also accepted:

valid_dates:
  minimum: 2017-08-07T23:00:00Z

In this case, there is a lower bound or minimum date, but no upper bound, so the evaluation will consider time-series values whose valid times occur on or after 2017-08-07T23:00:00Z.

A reference time interval may be declared in a similar way:

reference_dates:
  minimum: 2017-08-07T23:00:00Z
  maximum: 2017-08-08T23:00:00Z

Finally, lead times may be constrained like this:

lead_times:
  minimum: 0
  maximum: 18
  unit: hours

In this example, the evaluation will only consider forecast values whose lead times are between 0 hours and 18 hours, inclusive.

5.3. How do I constrain analysis times?

When using model analyses in an evaluation, these analyses are sometimes referenced to the model initialization time, which is a particular flavor of reference time. For example, the National Water Model can cycle for hourly periods prior to the forecast initialization time and produce an “analysis” for each period. These analysis durations may be constrained in WRES.

For example, consider the following declaration:

analysis_times:
  minimum: -2
  maximum: 0
  unit: hours

In this case, the evaluation will consider analysis cycles that are less than 2 hours before the model initialization time, up to the initialization time of 0 hours.

5.4. How do I declare a seasonal evaluation?

The WRES allows for a seasonal evaluation to be declared through a season filter. The season filter will apply to the valid times associated with the pairs when both sides of the pairing contain non-forecast sources (i.e., there are no reference times present); otherwise it will apply to the reference times (i.e., when one or both sides of the pairing contain forecasts).

A seasonal evaluation is declared with a minimum day and month and a maximum day and month. For example:

season:
  minimum_day: 1
  minimum_month: 4
  maximum_day: 31
  maximum_month: 7

In this example, the evaluation will consider only those pairs whose valid times (non-forecast sources) or reference times (forecast sources) fall between 0Z on 1 April and an instant before 0Z on 1 August (i.e., the very last time on 31 July).

5.5. How do I perform event detection?

Event detection allows for the automated detection of periods of interest or "events" within time-series datasets for subsequent evaluation. The concept and declaration of event detection is described here: Event detection.

6. How do I declare the measurement units for the evaluation?

6.1. What units does WRES accept and how do I declare them?

The desired measurement units are declared as follows:

unit: m3/s

The unit may be any valid Unified Code for Units of Measure (UCUM). In addition, the WRES will accept several informal measurement units that are widely used in hydrology, such as CFS (cubic feet per second, formal UCUM unit [ft_i]3/s), CMS (cubic meters per second, formal UCUM unit m3/s) and IN (inches, formal UCUM unit [in_i]).

Further details on units of measurement can be found in a separate wiki, Units of measurement.

6.2. What happens if WRES does not understand a unit supplied in a data source?

If a data source contains a measurement unit that is unrecognized by WRES, you may receive an UnrecognizedUnitException indicating that a measurement unit alias should be defined. A measurement unit alias is a mapping between an unrecognized or informal measurement unit, known as an alias, and a formal UCUM unit, known as a unit. For example, consider the following declaration:

unit: K
unit_aliases:
  - alias: °F
    unit: '[degF]'
  - alias: °C
    unit: '[cel]'

In this example, °F and °C are informal measurement units whose corresponding UCUM units are [degF] and [cel], respectively. The desired measurement unit is, K or kelvin. By declaring unit_aliases, the WRES will understand that any references to °F should be interpreted as formal unit, [degF] and any references to °C should be interpreted as formal unit, [cel]. This will allow the software to convert the informal units of °F and °C, on the one hand, to the formal unit of K, on the other.

Further information about units of measurement and aliases can be found in a separate wiki, Units of measurement.

7. How do I filter the time-series data to consider only values within a range?

In some cases, it is necessary to omit values that fall outside a particular range. For example, it may be desirable to only evaluation precipitation forecasts whose values are greater than an instrument detection limit. Restricting values to a particular range is achieved by declaring the minimum and/or maximum values that the evaluation should consider, as follows:

unit: mm
values:
  minimum: 0.0
  maximum: 100.0

In this example, only those values (observed, predicted and baseline) that fall within the range 0mm to 100mm will be considered. The values are always declared in evaluation units. Mechanically speaking, any values that fall outside this range will be assigned the default missing value identifier.

Optionally, however, values that fall outside of the nominated range may be assigned another value. For example:

unit: mm
values:
  minimum: 0.25
  maximum: 100.0
  below_minimum: 0.0
  above_maximum: 100.0

In this example, values that are less than 0.25mm will be assigned a value of 0mm (the below_minimum value) and values above 100mm will be assigned a value of 100mm (the above_maximum value).

8. How do I declare thresholds to use in an evaluation?

8.1. What types of thresholds are supported and how do I declare them?

There are three flavors of thresholds that may be declared:

Ordinary thresholds (thresholds), which are real-valued. If not otherwise declared, the thresholds values are assumed to be in the same measurement units as the evaluation;
Probability thresholds (probability_thresholds) whose values must fall within the interval [0,1]. These are converted into real-valued thresholds by finding the corresponding quantile of the observed dataset; and
Classifier thresholds (classifier_thresholds) whose values must fall within the interval [0,1]. These are used to convert probability forecasts into dichotomous (yes/no) forecasts.

The simplest use of thresholds may look like this, in context:

observed: some_observations.csv
predicted: some_forecasts.csv
unit: ft
thresholds: 12.3

In this case, the evaluation will consider only those pairs of observed and predicted values where the observed value exceeds 12.3 FT.

There are several other attributes that may be declared alongside the threshold value(s). For example, consider this declaration:

observed: some_observations.csv
predicted: some_forecasts.csv
unit: m
thresholds:
  name: MAJOR FLOOD
  values: 
    - { value: 23.0, feature: DRRC2 }
    - { value: 27.0, feature: DOLC2 }
  operator: greater equal
  apply_to: predicted
  unit: ft

In this example, the evaluation will consider only those pairs of observed and predicted values at DRRC2 where the predicted value is greater than or equal to 23.0 FT and only those paired values at DOLC2 where the predicted value is greater than 27.0 FT. Further, for both locations, this threshold will be labelled MAJOR FLOOD. The evaluation itself will be conducted in units of m (meters), so these thresholds will be converted from ft to m prior to evaluation.

The acceptable values for the operator include:

greater;
greater equal;
less;
less equal;
between; and
equal.

When declaring thresholds that are between two values, x and y, then these two values form the left-closed interval, [x,y) or x <= value < y. If only one threshold value is declared, then the upper bound is assumed to be positive infinity. If more than two threshold values are declared, then a new interval will be formed from each new value in the sequence (after the values have been sorted in ascending order of magnitude). For example, when declaring three values, x, y and z where x < y < z, then the following two thresholds will be formed: [x,y) and [y,z).

The acceptable values for the apply_to include:

observed: include the pair when the condition is met for the observed value;
predicted: include the pair when the condition is met for the predicted value (or baseline predicted value for baseline pairs);
observed and predicted: include the pair when the condition is met for both the observed and predicted values (or baseline predicted value for baseline pairs);
any predicted: include the pair when the condition is met for any of the predicted values with an ensemble (or baseline predicted value for baseline pairs);
observed and any predicted: include the pair when the condition is met for both the observed value and for any of the predicted values within an ensemble (or baseline predicted value for baseline pairs);
predicted mean: include the pair when the condition is met for the ensemble mean of the predicted values (or baseline predicted value for baseline pairs); and
observed and predicted mean: include the pair when the condition is met for both the observed value and the ensemble mean of the predicted values (or baseline predicted value for baseline pairs).

The apply_to is only relevant when filtering pairs for metrics that apply to continuous variables, such as the mean error (e.g., of streamflow predictions), and not when transforming pairs, such as converting continuous pairs to probabilistic or dichotomous pairs. For the latter, both sides of the pairing are always transformed, by definition.

The probability thresholds and classifier thresholds may be declared in a similar way. For example:

observed: some_observations.csv
predicted: some_forecasts.csv
unit: ft
probability_thresholds: [0.1,0.5,0.9]

In this example, the evaluation will consider only those pairs of observed and predicted values where the observed value is greater than each of the 10th, 50th and 90th percentiles of the observed values.

8.2. What if I want to declare different thresholds for different metrics?

All of the declaration options for thresholds that are applied to the evaluation as a whole can be applied equally to individual metrics within the evaluation, if desired. For example, consider the following declaration:

observed: some_file.csv
predicted: another_file.csv
unit: ft
metrics:
  - name: mean square error skill score
    thresholds: 23
  - name: pearson correlation coefficient
    probability_thresholds:
      values: [0.1,0.2]
      operator: greater equal

In this example, the mean square error skill score will be computed for those pairs of observed and predicted values where the observed value exceeds 23.0 FT. Meanwhile, the pearson correlation coefficient will be computed for those pairs of observed and predicted values where the observed value is greater than or equal to the 10th percentile of observed values and, separately, the 20th percentile of observed values.

8.3. Can I obtain thresholds from external data sources?

Yes. An evaluation may declare thresholds from one or both of these external sources:

The Water Resources Data Service (WRDS) threshold service; and
Comma separate values from a file on the default filesystem.

8.4. How do I declare thresholds from the Water Resources Data Service (WRDS)?

For those users with access to the WRDS threshold service, the WRES will request thresholds from the WRDS when declared. The WRDS is available to those with access to web services hosted at the National Water Center (NWC) in Alabama. The WRDS hostname is omitted below; if you need the hostname, refer to the COWRES user support wiki or contact the WRES team. Consider the following declaration:

observed:
  sources: data/CKLN6_STG.xml
  feature_authority: nws lid
predicted: data/CKLN6_HEFS_STG_forecasts.tgz
features:
  - observed: CKLN6
threshold_sources: https://[WRDS]/api/location/v3.0/nws_threshold/

Where [WRDS] is the host name for the WRDS production service (to be inserted). Note the use of feature_authority, which is important in this context. In particular, it allows WRES to pose a complete/accurate request to WRDS, namely “please provide the streamflow thresholds associated with an NWS LID of CKLN6”. By default, the WRES will request streamflow thresholds unless otherwise declared.

Consider a more complicated declaration:

observed:
  sources:
    - uri: https://nwis.waterservices.usgs.gov/nwis/iv
      interface: usgs nwis
  variable:
    name: '00060'
predicted:
  sources:
    - uri: data/nwmVector/
      interface: nwm short range channel rt conus
  variable: streamflow
features:
  - {observed: '07140900', predicted: '21215289'}
  - {observed: '07141900', predicted: '941030274'}
threshold_sources:
  uri: https://[WRDS]/api/location/v3.0/nws_threshold/
  parameter: stage
  provider: NWS-NRLDB
  rating_provider: NRLDB
  missing_value: -999.0
  feature_name_from: predicted

In this example, the WRES will ask WRDS to provide all thresholds for the parameter of stage, the provider of NWS-NRLDB, and the rating_provider of NRLDB and for those geographic features with NWM feature IDs of 21215289 and 941030274. Furthermore, the evaluation will consider any threshold values of –999.0 to be missing values.

8.5. How do I declare thresholds to read from CSV files?

Thresholds may be read from CSV files in a similar way to thresholds from the Water Resources Data Service (WRDS). For example, consider the following declaration:

threshold_sources: data/thresholds.csv

In this example, thresholds will be read from the path data/thresholds.csv on the default filesystem. By default, they will be treated as ordinary, real-valued, thresholds in the same units as the evaluation and for the same variable.

The options available to qualify thresholds from WRDS are also available to qualify thresholds from CSV files. For example, consider the following declaration:

threshold_sources:
  - uri: data/thresholds.csv
    missing_value: -999.0
    feature_name_from: observed
  - uri: data/more_thresholds.csv
    missing_value: -999.0
    feature_name_from: predicted
    type: probability

In this example, thresholds will be read from two separate paths on the default filesystem, namely data/thresholds.csv and data/more_thresholds.csv. The thresholds from data/thresholds.csv will be treated as ordinary, real-valued, thresholds whose feature names correspond to the observed dataset. Conversely, the thresholds from data/more_thresholds.csv will be treated as probability thresholds whose feature names correspond to the predicted dataset. In both cases, values of –999.0 are considered to be missing values.

By way of example, the CSV format should contain a location or geographic feature identifier in the first column, labelled locationId, and one conceptual threshold per column in the remaining columns, with each column header containing the name of that threshold, if appropriate (otherwise blank), and each row containing a separate location:

locationId, ACTION, MINOR FLOOD
CKLN6, 10, 12
WALN6, 7.5, 9.5

9. How do I declare the pools of data that should be evaluated separately?

A “pool” is the atomic unit of paired data from which a statistic is computed. Typically, there are many pools of pairs in each evaluation. For example, considering pooling over time, or temporal pooling, if the goal is to evaluate a collection of forecasts at each forecast lead time, separately, and all of the forecasts contain 3-hourly lead times for 2 days, then there are 24/3*2=16 lead times and hence 16 pools of data to evaluate.

Pooling can be done temporally (over time) or spatially (over features), both of which are described here.

9.1. How do I declare a regular sequence of temporal pools?

In general, an evaluation will require a regular sequence of pools along one or more of the timelines described in What timelines are understood by WRES and how do I constrain them?, namely:

Valid times;
Reference times (of forecasts); and
Lead times (of forecasts).

There is a consistent grammar for declaring a regular sequence of pools along each of these timelines. In each case, the sequence begins at the minimum value and ends at the maximum value associated with the corresponding timeline described in What timelines are understood by WRES and how do I constrain them?. For the same reason, a sequence of pools requires both a constraint on the timeline and the pool sequence itself. For example:

reference_dates:
  minimum: 2023-03-17T00:00:00Z
  maximum: 2023-03-19T19:00:00Z
reference_date_pools:
  period: 13
  unit: hours

In this example, there is a regular sequence of reference time pools. The sequence begins at 2023-03-17T00:00:00Z and ends at 2023-03-19T19:00:00Z, inclusive. Each pool is 13 hours wide and a new pool begins every 13 hours. In other words, the pools are not overlapping, by default. Using interval notation, the above declaration would produce the following sequence of pools where ( means that the lower boundary is excluded and ] means that the upper boundary is included:

Pool rp1: (2023-03-17T00:00:00Z, 2023-03-17T13:00:00Z]
Pool rp2: (2023-03-17T13:00:00Z, 2023-03-18T02:00:00Z]
Pool rp3: (2023-03-18T02:00:00Z, 2023-03-18T15:00:00Z]
Pool rp4: (2023-03-18T15:00:00Z, 2023-03-19T04:00:00Z]
Pool rp5: (2023-03-19T04:00:00Z, 2023-03-19T17:00:00Z]

Note that there is no “Pool 6” because a pool cannot partially overlap the minimum or maximum dates on the timeline.

If we assume that four separate forecasts were issued, beginning at 2023-03-17T00:00:00Z and repeating every 12 hours, then the timeline may be visualized as follows, where fc is a forecast whose reference time is denoted 0 and rp is a reference date pool:

                        fc1: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                    fc2: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                                fc3: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                                            fc4: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

    time: ─┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼───
          16th  17th  17th  17th  17th  18th  18th  18th  18th  19th  19th  19th  19th  20th  20th  20th  20th  21st
          18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z

    boundaries:  ├                                                                  ┤

            rp1: └────────────┘       rp3: └────────────┘       rp5: └────────────┘

                         rp2: └────────────┘       rp4: └────────────┘

In this example, fc1 would fall in pool rp1, fc2 would fall in pool rp2, and so on. Pool rp5 would contain no data because there are no reference times that fall within it.

A regular sequence of valid time pools or lead time pools may be declared in a similar way. For example, the equivalent pools by valid time are:

valid_dates:
  minimum: 2023-03-17T00:00:00Z
  maximum: 2023-03-19T19:00:00Z
valid_date_pools:
  period: 13
  unit: hours

A similar sequence of lead time pools may be declared as follows:

lead_times:
  minimum: 0
  maximum: 44
  unit: hours
lead_time_pools:
  period: 13
  unit: hours

When evaluating a recent historical period, it may be simpler to visualize a sequence that moves backwards in time from the maximum (e.g., "today"), rather than counting forwards in time from the minimum. This is achieved with the reverse: true flag, as follows:

reference_dates:
  minimum: 2024-03-01T00:00:00Z
  maximum: 2025-03-31T00:00:00Z
reference_date_pools:
  period: 60
  unit: days
  reverse: true

In this case, the first pool will end on 2025-03-31T00:00:00Z and will span a period of 60 days prior to this datetime, with the remaining pools counting backwards in time every 60 days.

9.2. Can I declare a regular sequence of temporal pools that overlap each other?

Yes, pools may overlap or underlap each other; in other words, the pool boundaries may not abut perfectly. This is achieved by declaring a frequency, which operates alongside the period. For example:

reference_dates:
  minimum: 2023-03-17T00:00:00Z
  maximum: 2023-03-19T19:00:00Z
reference_date_pools:
  period: 13
  frequency: 7
  unit: hours

In this case, a new reference time pool will begin every 7 hours and each pool will be 13 hours wide. To continue the above example and visualization:

                        fc1: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                    fc2: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                                fc3: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

                                                            fc4: 0  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v  v

    time: ─┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼───
          16th  17th  17th  17th  17th  18th  18th  18th  18th  19th  19th  19th  19th  20th  20th  20th  20th  21st
          18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z   06Z   12Z   18Z   00Z

    boundaries:  ├                                                                  ┤

            rp1: └────────────┘  rp4: └────────────┘  rp7: └────────────┘      

                   rp2: └────────────┘  rp5: └────────────┘  rp8: └────────────┘        

                          rp3: └────────────┘  rp6: └────────────┘

Here, pools rp1 through rp7 each contain one forecast and pool rp8 contains no forecasts.

9.3 Can I declare an irregular sequence of temporal pools?

Yes, an explicit or irregular sequence of pools may be declared using time_pools. These pools can be declared instead of, or in addition to, a regular sequence. For example, the following declaration contains a regular sequence of lead_time_pools that span 0 to 120 hours, every 6 hours, as well as an explicit pool that spans 0 to 5 days.

lead_time_pools:
  period: 6
  unit: hours
lead_times:
  minimum: 0
  maximum: 120
  unit: hours
time_pools:
  - lead_times:
      minimum: 0
      maximum: 5
      unit: days

When declaring explicit time_pools, any of the lead_times, valid_dates or reference_dates may be declared. In the following example, there are two explicit time pools. The first pool considers valid_dates between 1995-03-18T00:00:00Z and 1995-03-21T00:00:00Z and the second pool considers valid_dates between 1995-03-21T00:00:00Z and 1995-03-27T00:00:00Z, as well as reference_dates between 1995-03-21T06:00:00Z and 1995-03-29T06:00:00Z. Each list item, denoted by a -, begins a new pool and each pool may contain up to three time dimensions, as noted above.

time_pools:
  - valid_dates:
      minimum: 1995-03-18T00:00:00Z
      maximum: 1995-03-21T00:00:00Z
  - valid_dates:
      minimum: 1995-03-21T00:00:00Z
      maximum: 1995-03-27T00:00:00Z
    reference_dates:
      minimum: 1995-03-21T06:00:00Z
      maximum: 1995-03-29T06:00:00Z

For consistency with pools that are declared in a regular sequence, the minimum is always exclusive, whereas the maximum is inclusive. This allows for pools to overlap on a common boundary, which is more intuitive to declare, without the same time-series events falling into two separate pools, which is generally unintended/undesirable. However, in all contexts other than time_pools, the minimum and maximum values are inclusive.

9.4 Can I declare more than one sequence of pools in an evaluation?

Yes, more than one sequence of pools may be declared as a list. For example, consider the following:

reference_dates:
  minimum: 2024-03-01T00:00:00Z
  maximum: 2055-03-31T00:00:00Z
reference_date_pools:
  - period: 30
    unit: days
  - period: 60
    unit: days
  - period: 90
    unit: days

In this case, the evaluation will include three separate sequences of reference_date_pools. The first sequence will begin at a reference datetime of 2024-03-01T00:00:00Z and will span a period of 30 days, repeating every 30 days. The second sequence will begin on 2024-03-01T00:00:00Z and will span a period of 60 days, repeating every 60 days. Finally, the third sequence will begin on 2024-03-01T00:00:00Z and will span a period of 90 days, repeating every 90 days.

9.5 How can I pool spatially, over the geographic features of an evaluation?

An evaluation answers a question (e.g., about forecast quality). When that question is concerned with a geographic area or region, it may be appropriate to gather and pool together data from several geographic features. However, there will be cases where pooling over geographic features is inappropriate, such as if evaluation land surface variables where evaluation results may vary significantly between features.

The why and how of pooling over geographic features is described in Pooling geographic features.

10. How do I declare the desired timescale (e.g., accumulation period)?

The desired timescale associated with the evaluation is declarative, which means that it may be different than the timescale of the existing datasets. However, the WRES currently only supports limited forms of “upscaling” (increasing the timescale of existing datasets) and does not support “downscaling” (reducing the timescale of existing datasets). More information about the timescale and rescaling can be found here: Time Scale and Rescaling Time Series.

10.1. How do I declare a fixed timescale?

A fixed timescale contains three elements, namely:

The period, which is the number of time units to which the value applies;
The time unit associated with the period. Supported values include:
- seconds
- minutes
- hours; and
- days; and
The function, which describes how the value is distributed over the period. Supported values include:
- mean;
- minimum;
- maximum; and
- total.

For example, to declare a desired timescale that represents a mean average value over a 6 hour period, use the following:

time_scale:
  function: mean
  period: 6
  unit: hours

10.2. Can I declare a timescale that spans certain dates?

Yes. The desired timescale can span an explicit period that begins or ends on a particular date or an implicit (and potentially varying) period that begins and ends on nominated dates. For example, to declare a timescale that represents a maximum value that occurs between 0Z on 1 April and the instant before 0Z on 1 August (i.e., the end of 31 July), declare the following:

time_scale:
  function: maximum
  minimum_day: 1
  minimum_month: 4
  maximum_day: 31 
  maximum_month: 7

More information and examples can be found here: Time Scale and Rescaling Time Series.

11. How do I declare the metrics to evaluate?

11.1. Why do I need to declare the metrics to evaluate?

In principle, you don’t. Recall the simplest possible evaluation described in What is the simplest possible evaluation I can declare?:

observed: observations.csv
predicted: predictions.csv

When no metrics are declared explicitly, the software will read the time-series data and evaluate all metrics that are appropriate for the types of data discovered. For example, if one of the data sources contains ensemble forecasts, then the software will include all metrics that are appropriate for ensemble forecasts.

11.2. I only want to calculate a few metrics. How can I do that?

While the metrics can be chosen by the software, it is often desirable to calculate only a subset of the metrics that are technically valid for a given type of data. A list of metrics may be declared as follows:

metrics:
  - sample size
  - mean error
  - mean square error

The list of supported metrics is provided here: List of metrics available.

11.3. Do any of the metrics have parameters that I can declare?

In rare cases, it may be necessary to declare parameter values for some metrics. For example, if graphics formats are required for some metrics and not others, you can indicate that specific graphics formats should be omitted for some metrics:

metrics:
  - sample size
  - mean error
  - name: ensemble quantile quantile diagram
    png: false
    svg: false
  - mean square error

In this example, the png and svg graphics formats would be omitted for the ensemble quantile quantile diagram. Note that, in order to distinguish the metric name from the parameter values, the name key is now declared explicitly for the ensemble quantile quantile diagram, but is not required for the other metrics, as they do not have parameters.

The list of currently supported parameter values are tabulated below.

Parameter	Applicable metrics	Purpose	Example in context
`png`	All.	A flag that allows for Portable Network Graphics (PNG) to be turned on (`true`) or off (`false`).	metrics: - name: ensemble quantile quantile diagram png: false
`svg`	All.	A flag that allows for Scalable Vector Graphics (SVG) to be turned on (`true`) or off (`false`).	metrics: - name: ensemble quantile quantile diagram svg: false
`thresholds`	All.	Allows `thresholds` to be declared for a specific metric (rather than all metrics). To ensure that the metric is computed for the superset of pairs or "all data" only, and not for any other declared thresholds, you may use thresholds: all data.	metrics: - name: mean error thresholds: [10.0,20.0,30.0]
`probability_thresholds`	All.	Allows `probability_thresholds` to be declared for a specific metric (rather than all metrics).	metrics: - name: mean error probability_thresholds: [0.1,0.2,0.3]
`classifier_thresholds`	All dichotomous metrics (e.g., `probability of detection`).	Allows `classifier_thresholds` to be declared for a specific, dichotomous metric (rather than all dichotomous metrics).	metrics: - name: probability of detection classifier_thresholds: [0.1,0.2,0.3]
`ensemble_average`	All single-valued metrics as they relate to ensemble forecasts (e.g., `mean error`).	A function to use when deriving a single value from an ensemble of values. For example, to calculate the ensemble mean, the `ensemble_average` should be `mean`. The supported values are: - `mean` - `median`	metrics: - name: mean error ensemble_average: mean
`summary_statistics`	All time-series metrics (e.g., `time to peak error`).	A collection of summary statistics to calculate from the distribution of time-series errors. For example, when calculating the `time to peak error`, there is one error value for each forecast and hence a distribution of errors across all forecasts. When declaring the `median` in this context, the median time to peak error will be reported alongside the distribution of errors. The supported values are: - `mean` - `median` - `minimum` - `maximum` - `mean absolute` - `standard deviation`	metrics: - name: time to peak error summary_statistics: - median - minimum - maximum - mean absolute - mean - standard deviation

12. How do I declare summary statistics?

Summary statistics can be used to describe or summarize a broader collection of evaluation statistics, such as the statistics associated with all geographic features in an evaluation. Further information about summary statistics is available here: Evaluation summary statistics.

Summary statistics are declared as a list of summary_statistics. For example:

summary_statistics:
  - mean
  - standard deviation

By default, summary statistics are calculated across all geographic features. Optionally, the dimensions to summarize may be declared explicitly. For example:

summary_statistics:
  statistics: 
    - mean
    - standard deviation
  dimensions:
    - features
    - feature groups

In this example, the features option indicates that summary statistics should be calculated for all geographic features within the evaluation. These features may be declared explicitly as features or using a feature_service with one or more group whose pool option is set to “false” or they may be declared implicitly with sources that contain time-series data for named features. In addition, the feature groups option indicates that summary statistics should be calculated for each geographic feature group separately. These feature groups may be declared as feature_groups or using a feature_service with one or more group whose pool option is set to “true”. When declaring summary statistics for feature groups, one or more feature groups must also be declared.

A few of the summary statistics support additional parameters, notably the quantiles and the histogram. In that case, the statistic name must be qualified separately from the parameters. For example:

summary_statistics:
  statistics:
    - mean
    - median
    - minimum
    - maximum
    - standard deviation
    - mean absolute
    - name: quantiles
      probabilities: [0.05,0.5,0.95]
    - name: histogram
      bins: 5
    - box plot

The default probabilities associated with the quantiles are 0.1, 0.5, and 0.9. The default number of bins in the histogram is 10.

13. How do I ask for sampling uncertainties to be estimated?

The sampling uncertainties may be estimated using a resampling technique, known as the “stationary bootstrap”. The declaration requires a sample_size and a list of quantiles to estimate. For example:

sampling_uncertainty:
  sample_size: 1000
  quantiles: [0.05,0.95]

Care should be taken in choosing the sample_size because each additional sample requires that the pairs are resampled for every pool and the statistics recalculated each time, which is computationally expensive.

See Sampling uncertainty assessment for more details.

14. How do I declare output formats to write?

The statistics output formats are declared by listing them. For example:

output_formats:
  - csv2
  - pairs
  - png

When no output_formats are declared, the software will write the csv2 format, by default. For example, when considering the simplest possible evaluation described in What is the simplest possible evaluation I can declare?, no output_formats are declared and csv2 will be written.

The supported statistics formats include:

png: Portable Network Graphics (PNG);
svg: Scalable Vector Graphics (SVG);
csv2: Comma separated values with a single file per evaluation (see Output Format Description for CSV2 for more information);
netcdf2: Network Common Data Form (NetCDF);
protobuf: Protocol buffers. An efficient binary format that produces one file per evaluation.

The following statistics formats are supported (for now), but are deprecated for removal and should be avoided:

csv: comma separated values; and
netcdf: comma separated values.

In addition, to help with tracing statistics to the paired values that produced them, the following is supported:

pairs: Comma separated values of the paired time-series data from which statistics were produced (which are gzipped, by default).

Some of these formats support additional parameters, as follows:

Parameter	Applicable formats	Purpose	Example in context
`width`	All graphics formats (e.g., `png`).	An integer value (greater than 0) that prescribes the width of the graphics to produce.	output_formats: - format: png width: 800
`height`	All graphics formats (e.g., `png`).	An integer value (greater than 0) that prescribes the height of the graphics to produce.	output_formats: - format: png height: 600

15. Are there any other options?

Yes, there several additional other options for filtering or transforming data or otherwise refining the evaluation. These are listed below:

Option	Purpose	Example in context
`pair_frequency`	By default, all paired values are included. However, this option allows for paired values to be included only at a prescribed frequency, such as every 12 hours.	observed: some_observations.csv predicted: some_predictions.csv pair_frequency: period: 12 unit: hours
`cross_pair`	When calculating skill scores, all paired values are used by default. This can be misleading when the (`observed`, `predicted`) pairs contain many more or fewer pairs than the (`observed`, `baseline`) pairs. In order to mitigate this, cross pairing is supported. When using cross-pairing, only those pairs whose valid times appear in both sets of pairs will be included. In addition, the treatment of forecast reference times is prescribed by an option. The available options are: - `exact`: Only admit those pairs whose forecast reference times appear in both sets of pairs; and - `fuzzy`: Choose the nearest forecast reference times in both sets of pairs and discard any others. In all cases, the resulting skill score statistics will always use the same number of (`observed`, `predicted`) pairs and (`observed`, `baseline`) pairs. In addition, when using `exact` cross-pairing, the valid times and reference times are both guaranteed to match exactly.	observed: some_observations.csv predicted: some_predictions.csv baseline: some_more_predictions.csv cross_pair: exact
`minimum_sample_size`	An integer greater than zero that identifies the minimum sample size for which a statistic will be included. For continuous measures, this is the number of pairs. For dichotomous measures, it is the smaller of the number of occurrences and non-occurrences of the dichotomous event. If a statistic was computed from a smaller sample size than the `minimum_sample_size`, it will be discarded.	observed: some_observations.csv predicted: some_predictions.csv minimum_sample_size: 30
`decimal_format`	The decimal format to use when writing statistics to numeric formats. It also controls the format of tick labels for time-based domain axes in generated graphics.	observed: some_observations.csv predicted: some_predictions.csv decimal_format: '#0.000000'
`duration_format`	The duration format to use when writing statistics to numeric formats. It also controls the units of time-based domain axes in generated graphics. The supported values include: - `seconds` - `minutes` - `hours` - `days`	observed: some_observations.csv predicted: some_predictions.csv duration_format: hours
`combined_graphics`	A boolean value (`true` or `false`) that controls whether the statistics for the `predicted` and `baseline` scenarios are plotted together (`true`) or separately (`false`). To plot the `predicted` and `baseline` statistics together, the evaluation must include separate statistics for the baseline, which requires a `baseline` dataset with `separate_metrics: true`, and a valid graphics format among the `output_formats` (e.g., `png`). Combined graphics will not be generated for any box plots. Also, when the evaluation includes skill scores that support a default reference prediction, such as climatology (e.g., the `brier skill score`), combined graphics will not be generated for these skill scores as the `baseline` statistics will use the default reference, which is not readily comparable to the skill score for the `predicted` dataset (which uses the `baseline` as a reference).	observed: some_observations.csv predicted: some_predictions.csv baseline: sources: some_baseline_predictions.csv separate_metrics: true combined_graphics: true output_formats: - png

16. Do you have some examples of complete declarations?

Yes, examples of complete declarations can be found in a separate wiki, Complete Examples of Evaluation Declarations TODO.

17. Does the declaration language use a schema?

Yes, the declaration language uses a schema, which defines the superset of declarations that the WRES could accept. The schema uses the JSON schema language:

https://json-schema.org/

The latest version of the schema is available in the code repository:

https://github.com/NOAA-OWP/wres/blob/master/wres-config/nonsrc/schema.yml

However, the schema is relatively permissive. In other words, there are some evaluations that are permitted by the schema that are not permitted by the WRES software itself. Indeed, a schema is best suited for simple validation. More comprehensive validation is performed by the software itself, once the declaration has been validated against the schema.

In practice, you may notice this when reading feedback from the software about validation failures. The earliest failures will occur when the declaration is inconsistent with the schema. The feedback that results from these failures will tend to be more abstract or less human readable because it will list a cascade of failures. In other cases, the failure will be straightforward. You should generally look for the simplest/most understandable among them. For example, a declaration like this:

observed: some_observations.csv
predicted: some_forecasts.csv
foo: bar.csv

Will produce an error like this, because the foo key is not part of the schema and the schema does not permit additional properties:

wres.config.yaml.DeclarationException: When comparing the declared evaluation to the schema, encountered 1 errors, which must be fixed. Hint: some of these errors may have the same origin, so look for the most precise/informative error(s) among them. The errors are:
    - $.foo: is not defined in the schema and the schema does not allow additional properties

18. What does this error really mean?

You will sometimes encounter warnings or errors that relate to your declaration. For example, if an error is wrapped in a DeclarationException, the problem will originate from your declaration. These errors arise because the declaration is invalid for some reason. There are three main reasons why a declaration could be invalid:

The declaration is not a valid YAML document. You can test whether your declaration is a valid YAML document using an online tool, such as: https://www.yamllint.com/
The declaration contains options that are not understood or allowed by WRES (specifically, they are not consistent with the declaration schema, as described in Does the declaration language use a schema?). For example, if you include options that are misspelled or options that fall outside valid bounds, such as probabilities that fall outside [0,1], you can expect an error; or
The declaration contains options that are disallowed by WRES in combination with other options. For example, if you add an ensemble-like metric and declare that none of the data types are ensemble-like, then you can expect an error.

In general, any warning or error messages should be straightforward and intuitive, indicating what you should do to fix them (or, in the case of warnings, what you should consider about the options you chose). Furthermore, if there are multiple warnings or errors, they should all be listed at once. For example, consider the following invalid declaration:

observed: some_observations.csv
predicted: some_predictions.csv
lead_time_pools:
  period: 13
  unit: hours
metrics:
  - probability of detection

This declaration produces the following errors:

wres.config.yaml.DeclarationException: Encountered 2 error(s) in the declared evaluation, which must be fixed:
    - The declaration included 'lead_time_pools', which requires the 'lead_times' to be fully declared. Please remove the 'lead_time_pools' or fully declare the 'lead_times' and try again.
    - The declaration includes metrics that require either 'thresholds' or 'probability_thresholds' but none were found. Please remove the following metrics or add the required thresholds and try again: [PROBABILITY OF DETECTION].

If the errors are not intuitive, you should create a ticket asking for more clarity and we will explain the failure and improve the error message. However, errors that fall within the first two categories are delegated to other tools and are not, therefore, fully within our control. For example, when your declaration fails validation against the schema, you may be presented with a cascade of errors that are not immediately intuitive. For example, consider the following, invalid declaration:

observed: some_observations.csv
predicted: some_predictions.csv
metrics:
  - some metric

Since some metric is not an expected metric, this declaration will produce an error. However, the evaluation actually produces a cascade of errors, which occur because the metrics declaration is invalid against any known (sub)schema within the overall schema:

wres.config.yaml.DeclarationException: When comparing the declared evaluation to the schema, encountered 5 errors, which must be fixed. Hint: some of these errors may have the same origin, so look for the most precise/informative error(s) among them. The errors are:
    - $.metrics[0]: does not have a value in the enumeration [box plot of errors by observed value, box plot of errors by forecast value, brier score, brier skill score, contingency table, continuous ranked probability score, continuous ranked probability skill score, ensemble quantile quantile diagram, maximum, mean, minimum, rank histogram, relative operating characteristic diagram, relative operating characteristic score, reliability diagram, sample size, standard deviation]
    - $.metrics[0]: does not have a value in the enumeration [bias fraction, box plot of errors, box plot of percentage errors, coefficient of determination, pearson correlation coefficient, index of agreement, kling gupta efficiency, mean absolute error, mean error, mean square error, mean square error skill score, mean square error skill score normalized, median error, quantile quantile diagram, root mean square error, root mean square error normalized, sample size, sum of square error, volumetric efficiency, mean absolute error skill score]
    - $.metrics[0]: does not have a value in the enumeration [contingency table, threat score, equitable threat score, frequency bias, probability of detection, probability of false detection, false alarm ratio, peirce skill score]
    - $.metrics[0]: string found, object expected
    - $.metrics[0]: does not have a value in the enumeration [time to peak relative error, time to peak error]

This cascade of errors is somewhat unintuitive but, at the time of writing, it cannot be improved easily. As suggested in the Hint, you should look for the most precise and informative error among the cascade. In this case, it should be reasonably clear that the metric in position “[0]” (meaning the first metric) is not a name that occurs within any known enumeration. As the schema includes several metric groups, each with a separate enumeration, this error is reported with respect to each group.