Improving ETA Prediction Accuracy for Long tail Events - miaow27/ds_learning_notes GitHub Wiki
High-level Summary
- DoorDash is trying to predict estimated arrival time (ETA) for customer food delivery.
- ETA is a heavily right skewed event and yet accurate prediction for those tailed event is important for customer satisfaction and brand trust.
- Tail Event
- tail event is different from outlier, outlier is much more rare (1%) and it is usually impossible to predict beforehand
- tail event is much more often (5-10%) and thus worth predicting
- tail event is hard to predict due to lack of sufficient data and lack of leading indicator for tail event
- How DoorDash handle this?
- added real-time features
- utilized historical features that were more effective at helping the algorithm learn the sparse patterns around tail events
- used a custom loss function to optimize for prediction accuracy when large deviations occur
Key Take-away
Feature Engineering
feature engineer require both of the following:
- a deep understanding of the business domain to identify signals that are predictive of the tail events.
- a technical grasp of how to represent this signal in the best way to help the model learn
example
- bucket marketplace health metric (a continuous variable, 0 - 100) into bins and replace with the average historical value in each bins to highlight the importance of extreme love marketplace health metric. (we directly helped the model learn that very supply-constrained market conditions are correlated with very high delivery times — rather than relying on the model to learn those patterns from the relatively sparse data available.)
- average delivery durations over the past 20 minutes at a store level and sub-region level
Custom Loss Function
- do not use linear penalization function. but rather use a quadradic loss function to penalize more at tail event.
- asymmetric MSE loss function to further emphasize cases when predicted is less than actual
$$\frac{1}{n}\sum_{i=1}^{n}|\alpha - 1_{g(x_i) - \hat g(x_i)<0}|(g(x_i) - \hat g(x_i))^2$$
- when $\alpha$ = 0.2, then overestimate will get 0.8 weight and underestimate will get 0.2 weight.
- when $\alpha$ = 0.5, all cases get same weight (same as MSE).
- when $\alpha$ = 0.8, then overestimate will get 0.2 weight and underestimate will get 0.8 weight.
In addition, asymmetric MSE loss more accurately and intuitively represented the business trade-offs we were facing. For example, by using this approach we need to explicitly state that a late delivery is X times worse than an early delivery (where X is equal to the following part of the equation above).
- $\alpha = x/(1-x)$