Model Credibility in the Wild - GaloisInc/ASKE-E GitHub Wiki
Reading List from Joshua
Understanding Ensemble models and COVID-19
-
Ensemble Forecasts of Coronavirus Disease 2019
- For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
- Each team/model submitted a median predicted cumulative death and 11 prediction intervals ranging from 10% to 98% weekly
- Ensemble achieved by averaging the prediction interval end points for each prediction level and location
- I assume they also averaged the median predicted death
- Variable number of models per location, meaning that a prediction for a particular location may only have 2 models
- Variable number of models in the course of the paper (6-20 depending on when)
- No evaluation at the level of the individual models
- They didn't consider the number of individual models in their evaluation (other than mean absolute error, which is divided by the number of models)
- For example, what if the ensemble accuracy came from a single model (only one of its components)? What if some of the individual models were only good at predicting a particular location?
- Each model used whatever approach or dataset they deemed appropriate
- I would think that this would lead to different model strengths
- The acceptance criteria for model in their ensemble is very basic (must include 1-4 week predictions and deaths can't be negative):
- A forecast had to include all four week-ahead horizons,
- The one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and
- At each quantile level, predictions should be non-decreasing over the four prediction horizons
- Note on maximum number of new deaths reported per week
- I think what they mean is max(number of deaths reported per week up through the week ending July 25)
- For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
-
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US
- The 2nd ensemble model paper
Metrics for measuring model skill
Given a set of models and different predictions (e.g. hospitalization and mortality rates), how can we measure and compare different models?
- IS and WIS are Interval Scores and Weighted Interval Scores respectively
- WIS can be computed to summarize accuracy across the entire predictive distribution, a particular linear combination of K scores. The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution
- Metric explanation slide and python implementation
Data from the 2nd paper:
MechBayes model
- UMass-MechBayes model paper
- One of the top 5 individual models in the paper
- Repository
- Model implementation
- Model description