Model Credibility in the Wild  GaloisInc/ASKEE GitHub Wiki
Reading List from Joshua
Understanding Ensemble models and COVID19

Ensemble Forecasts of Coronavirus Disease 2019
 For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
 Each team/model submitted a median predicted cumulative death and 11 prediction intervals ranging from 10% to 98% weekly
 Ensemble achieved by averaging the prediction interval end points for each prediction level and location
 I assume they also averaged the median predicted death
 Variable number of models per location, meaning that a prediction for a particular location may only have 2 models
 Variable number of models in the course of the paper (620 depending on when)
 No evaluation at the level of the individual models
 They didn't consider the number of individual models in their evaluation (other than mean absolute error, which is divided by the number of models)
 For example, what if the ensemble accuracy came from a single model (only one of its components)? What if some of the individual models were only good at predicting a particular location?
 Each model used whatever approach or dataset they deemed appropriate
 I would think that this would lead to different model strengths
 The acceptance criteria for model in their ensemble is very basic (must include 14 week predictions and deaths can't be negative):
 A forecast had to include all four weekahead horizons,
 The one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and
 At each quantile level, predictions should be nondecreasing over the four prediction horizons
 Note on maximum number of new deaths reported per week
 I think what they mean is max(number of deaths reported per week up through the week ending July 25)
 For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):

Evaluation of individual and ensemble probabilistic forecasts of COVID19 mortality in the US
 The 2nd ensemble model paper
Metrics for measuring model skill
Given a set of models and different predictions (e.g. hospitalization and mortality rates), how can we measure and compare different models?
 IS and WIS are Interval Scores and Weighted Interval Scores respectively
 WIS can be computed to summarize accuracy across the entire predictive distribution, a particular linear combination of K scores. The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution
 Metric explanation slide and python implementation
Data from the 2nd paper:
MechBayes model
 UMassMechBayes model paper
 One of the top 5 individual models in the paper
 Repository
 Model implementation
 Model description