# Model Credibility in the Wild - GaloisInc/ASKE-E GitHub Wiki

Understanding Ensemble models and COVID-19

• COVID Data Tracker (CDC)

• Ensemble Forecasts of Coronavirus Disease 2019

• For a location (at the level of state or all of the USA) and 4 prediction horizons (1, 2, 3, 4 week predictions):
• Each team/model submitted a median predicted cumulative death and 11 prediction intervals ranging from 10% to 98% weekly
• Ensemble achieved by averaging the prediction interval end points for each prediction level and location
• I assume they also averaged the median predicted death
• Variable number of models per location, meaning that a prediction for a particular location may only have 2 models
• Variable number of models in the course of the paper (6-20 depending on when)
• No evaluation at the level of the individual models
• They didn't consider the number of individual models in their evaluation (other than mean absolute error, which is divided by the number of models)
• For example, what if the ensemble accuracy came from a single model (only one of its components)? What if some of the individual models were only good at predicting a particular location?
• Each model used whatever approach or dataset they deemed appropriate
• I would think that this would lead to different model strengths
• The acceptance criteria for model in their ensemble is very basic (must include 1-4 week predictions and deaths can't be negative):
• The one week ahead forecast for cumulative deaths should not assign probability more than 0.1 to a reduction in cumulative deaths relative to already reported deaths, and
• At each quantile level, predictions should be non-decreasing over the four prediction horizons
• Note on maximum number of new deaths reported per week
• I think what they mean is max(number of deaths reported per week up through the week ending July 25)
• COVID-19 Forecast Hub - Ensemble Model

• Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

• The 2nd ensemble model paper
• Paper on the interval metric used in the ensemble paper

### Metrics for measuring model skill

Given a set of models and different predictions (e.g. hospitalization and mortality rates), how can we measure and compare different models?

• IS and WIS are Interval Scores and Weighted Interval Scores respectively
• WIS can be computed to summarize accuracy across the entire predictive distribution, a particular linear combination of K scores. The weighted interval score (WIS) is a proper score that combines a set of interval scores for probabilistic forecasts that provide quantiles of the predictive forecast distribution
• Metric explanation slide and python implementation