Model Credibility in the Wild - GaloisInc/ASKE-E GitHub Wiki

Reading List from Joshua

Understanding Ensemble models and COVID-19

COVID Data Tracker (CDC)
- https://www.cdc.gov/coronavirus/2019-ncov/covid-data/forecasting-us-previous.html
Ensemble Forecasts of Coronavirus Disease 2019
- For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
  - Each team/model submitted a median predicted cumulative death and 11 prediction intervals ranging from 10% to 98% weekly
- Ensemble achieved by averaging the prediction interval end points for each prediction level and location
  - I assume they also averaged the median predicted death
- Variable number of models per location, meaning that a prediction for a particular location may only have 2 models
- Variable number of models in the course of the paper (6-20 depending on when)
- No evaluation at the level of the individual models
  - They didn't consider the number of individual models in their evaluation (other than mean absolute error, which is divided by the number of models)
  - For example, what if the ensemble accuracy came from a single model (only one of its components)? What if some of the individual models were only good at predicting a particular location?
- Each model used whatever approach or dataset they deemed appropriate
  - I would think that this would lead to different model strengths
- The acceptance criteria for model in their ensemble is very basic (must include 1-4 week predictions and deaths can't be negative):
  - A forecast had to include all four week-ahead horizons,
  - The one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and
  - At each quantile level, predictions should be non-decreasing over the four prediction horizons
- Note on maximum number of new deaths reported per week
  - I think what they mean is max(number of deaths reported per week up through the week ending July 25)
COVID-19 Forecast Hub - Ensemble Model
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US
- The 2nd ensemble model paper
Paper on the interval metric used in the ensemble paper

Metrics for measuring model skill

Given a set of models and different predictions (e.g. hospitalization and mortality rates), how can we measure and compare different models?

IS and WIS are Interval Scores and Weighted Interval Scores respectively
- WIS can be computed to summarize accuracy across the entire predictive distribution, a particular linear combination of K scores. The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution
Metric explanation slide and python implementation

Data from the 2nd paper:

MechBayes model

UMass-MechBayes model paper
- One of the top 5 individual models in the paper
Repository
Model implementation
Model description