Model Credibility in the Wild - GaloisInc/ASKE-E GitHub Wiki

Reading List from Joshua

Understanding Ensemble models and COVID-19

  • COVID Data Tracker (CDC)

  • Ensemble Forecasts of Coronavirus Disease 2019

    • For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
      • Each team/model submitted a median predicted cumulative death and 11 prediction intervals ranging from 10% to 98% weekly
    • Ensemble achieved by averaging the prediction interval end points for each prediction level and location
      • I assume they also averaged the median predicted death
    • Variable number of models per location, meaning that a prediction for a particular location may only have 2 models
    • Variable number of models in the course of the paper (6-20 depending on when)
    • No evaluation at the level of the individual models
      • They didn't consider the number of individual models in their evaluation (other than mean absolute error, which is divided by the number of models)
      • For example, what if the ensemble accuracy came from a single model (only one of its components)? What if some of the individual models were only good at predicting a particular location?
    • Each model used whatever approach or dataset they deemed appropriate
      • I would think that this would lead to different model strengths
    • The acceptance criteria for model in their ensemble is very basic (must include 1-4 week predictions and deaths can't be negative):
      • A forecast had to include all four week-ahead horizons,
      • The one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and
      • At each quantile level, predictions should be non-decreasing over the four prediction horizons
    • Note on maximum number of new deaths reported per week
      • I think what they mean is max(number of deaths reported per week up through the week ending July 25)
  • COVID-19 Forecast Hub - Ensemble Model

  • Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

    • The 2nd ensemble model paper
  • Paper on the interval metric used in the ensemble paper

Metrics for measuring model skill

Given a set of models and different predictions (e.g. hospitalization and mortality rates), how can we measure and compare different models?

  • IS and WIS are Interval Scores and Weighted Interval Scores respectively
    • WIS can be computed to summarize accuracy across the entire predictive distribution, a particular linear combination of K scores. The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution
  • Metric explanation slide and python implementation

Data from the 2nd paper:

MechBayes model