Study : Practical Lessons from Predicting Clicks on Ads at Facebook - aragorn/home GitHub Wiki

Paper - https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/

Introduction

first build a cascade of classifiers. ... focus on the last stage click prediction model of a cascade classifier.
hybrid model which combines decision trees with logistic regression outperforms.
the most important thing is to have the right features: those capturing historical information about the user or ad dominate other types of features.

My keywords

cascade classifier
decision trees with logistic regression

Experimental Setup

Prepared 1) offline training data, 2) partition into training and testing 3) simulate the streaming data for online training and prediction.

Normalized Entropy(NE) and Calibration

Normalized Entropy(NE) and calibration as major evaluation metric
Normalized Cross-Entropy
- the average log loss per impression divided by what the average log loss per impression would be if a model predicted the background click through rate (CTR) for every impression
- the predictive log loss normalized by the entropy of the background CTR. The background CTR is the average empirical CTR of the training data set.
- Normalized Logarithmic Loss
- The lower, the better.
  - the closer the background CTR is to either 0 or 1, the easier it is to achieve a better log loss.
- a component in calculating Relative Information Gain (RIG)
  - RIG = 1 − NE
Calibration
- the ratio of the average estimated CTR and empirical CTR
- The less the calibration differs from 1, the better the model is.
Area-Under-ROC (AUC)
- NE measures the goodness of predictions, AUC measures ranking quality without considering calibration.
- See [12] J. Yi, Y. Chen, J. Li, S. Sett, and T. W. Yan. Predictive model performance: Offline and online evaluations. In KDD, pages 1294–1302, 2013.

My keywords

average log loss - Logarithmic Loss
Normalized Logarithmic Loss -- Logarithmic Loss
relative information gain -- Cross Entropy
Area-Under-ROC

Prediction Model Structure

probabilistic linear classifiers and online learning algorithms
a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classifier
Stochastic Gradient Descent (SGD)
- online learning schemes - Stochastic Gradient Descent (SGD) algorithm applied to sparse linear classifiers.
- L. Bottou. Online algorithms and stochastic approximations. In D. Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. revised, oct 2012.
Bayesian online learning scheme for probit regression (BOPR)
- compute p(w|y, x) ( weight vector ) and project it back to the closest factorizing Gaussian approximation of p(w) ( prior ).
Logistic Regression
- compare BOPR to an SGD of the likelihood function
- computing the derivative of the log-likelihood and walk a per-coordinate depending step size in the direction of this gradient

References - [2], [7]

My keywords

Stochastic Gradient Descent
Bayesian online learning scheme for probit regression, BOPR

Decision tree feature transforms

Simple ways to transform the input features of a linear classifier in order to improve its accuracy.
- For continuous features, a simple trick for learning non-linear transformations is to bin the feature and treat the bin index as a categorical feature.
- building tuple input features
  - For categorical features, the brute force approach consists in taking the Cartesian product
  - For continuous features, one can do joint binning, using for example a k-d tree
boosted decision trees
- follows Gradient Boosting Machine (GBM) [5], where the classic L2-TreeBoost algorithm is used.
boosted decision tree based transformation
- a supervised feature encoding that converts a real-valued vector into a compact binary-valued vector.
- a traversal from root node to a leaf node = a rule on certain features
- Fitting a linear classifier on the binary vector = learning weights for the set of rules
- Boosted decision trees are trained in a batch manner.
significant relative improvement of tree feature transformations
- decrease Normalized Entropy by more more than 3.4% relative to the Normalized Entropy of the model with no tree transforms.
- the majority of feature engineering experiments only manage to decrease Normalized Entropy by a fraction of a percentage.

My questions

Could not understand differences between LR+Trees, LR only and Trees only. A typical feature engineering experiment will shave off a couple of tens of a percent of relative NE. Then where the lower NE is the better, why 96.58% NE is so important?
- A: It sounds like that a feature engineering experiments would decrease NE by around 0.2%.
meaning of "bin the feature"
- A: continuous feature -- transform --> categorical feature

My keywords

Boosted decision trees - Decision Trees
Gradient Boosting Machine
Boosting

Data freshness

the effect of training data freshness on predictive performance
- train a model on one particular day and test it on consecutive days.
- NE can be reduced by approximately 1% by going from training weekly to training daily.
  - worth retraining on a daily basis.
It may take more than 24 hours to build a boosting model with hundreds of trees from hundreds of millions of instances with a single core cpu.
- In a practical case, the training can be done within a few hours via sufficient concurrency in a multi-core machine with large amount of memory for holding the whole training set.

Online linear classifier

In order to maximize data freshness.
several ways of setting learning rates for SGD-based online learning for logistic regression
- Per-coordinate learning rate : learning rate for feature i at iteration t
  - proposed in [8]
- Per-weight square root learning rate
- Per-weight learning rate
- Global learning rate
- Constant learning rate
SGD with per-coordinate learning rate achieves the best prediction accuracy
- with a NE almost 5% lower than when using per weight learning rate, which performs worst.
BOPR update equation for the mean is most similar to per-coordinate learning rate version of SGD for LR.
- very similar prediction performance in terms of both NE and also calibration.
advantages of LR over BOPR
- model size is half, given that there is only a weight associated to each sparse feature value, rather than a mean and a variance.
- In terms of computational expense at prediction time, the LR model requires one inner product, while BOPR models needs two inner products.
One important advantage of BOPR over LR
- Being a Bayesian formulation, it provides a full predictive distribution over the probability of click.
- can be used to compute percentiles of the predictive distribution.

My keywords

그다지...

Online Data Joiner

key component required for the online learning layer, the online joiner
experimental system that generates real-time training data used to train the linear classifier via online learning.
- "online joiner"
- join labels (click/no-click) to training inputs (ad impressions) in an online manner.
- The online joiner outputs a real-time training data stream to an infrastructure called Scribe [10].
no such thing as a "no click" button the user can press.
- if the user did not click the ad after a fixed, and sufficiently long period of time after seeing the ad.
- The length of the waiting time window needs to be tuned carefully.
  - too long
    - delays the real-time training data
    - increases the memory allocated to buffering impressions
  - too short
    - looses some of the clicks
      - "click coverage" = clicks successfully joined to impressions / all clicks
study on the window size and efficiency can be found at [6]
- easy to reduce this bias to decimal points of a percentage with waiting window sizes that result in manageable memory requirements.
- utilizing a request ID as the primary component of the join predicate. A request ID is generated every time a user performs an action (snipped).
schematic data and model flow
- user visits service -> request to the ranker for candidate ads
- ads
  - -> are passed to user's device
  - ad and associated features used in ranking -> impression stream
- user clicks -> click stream
- online joiner : merge impression and click using hash queue -> training stream
To achieve the stream-to-stream join the system utilizes a HashQueue consisting of a First-InFirst-Out queue as a buffer window and a hash map for fast random access to label impressions.
protection mechanisms against anomalies that could corrupt the online learning system
- If the click stream becomes stale because of some data infrastructure issue, the online joiner will produce training data that has a very small or even zero empirical CTR.
- automatically disconnect the online trainer from the online joiner if the real-time training data distribution changes abruptly.

Containing Memory and Latency

trade accuracy for memory and compute time

Number of boosting trees

We vary the number of trees from 1 to 2,000 and train the models on one full day of data, and test the prediction performance on the next day. We constrain that no more than 12 leaves in each tree.
- normalized entropy as an evaluation metric
- Almost all NE improvement comes from the first 500 trees.

Boosting feature importance

use the statistic Boosting Feature Importance, which aims to capture the cumulative loss reduction attributable to a feature.
- a small number of features contributes the majority of explanatory power while the remaining features have only a marginal contribution.
- the top 10 features are responsible for about half of the total feature importance, while the last 300 features contribute less than 1% feature importance.
usefulness of historical and contextual features.
- not able to reveal the detail on the actual features we use.
- Some example contextual features can be local time of day, day of week, etc. Historical features can be the cumulative number of clicks on an ad, etc.

Historical features

two types: contextual features and historical features
- contextual features - depends exclusively on current information regarding the context in which an ad is to be shown, such as the device used by the users or the current page that the user is on.
- historical features - depends on previous interaction for the ad or user
  - the click through rate of the ad in last week, or the average click through rate of the user
historical features provide considerably more explanatory power than contextual features
- The top 10 features ordered by importance are all historical features.
- Among the top 20 features, there are only 2 contextual features despite historical feature occupying roughly 75% of the features in this dataset.
It should be noticed that contextual features are very important to handle the cold start problem.
- For new users and ads.
test the feature dependency on data freshness
- the model with contextual features relies more heavily on data freshness than historical features.
- since historical features describe long-time accumulated user behaviour, which is much more stable than contextual features.

Coping with Massive Training Data

tradeoff between training data volume and accuracy
a small fraction of a day's worth of data > many hundreds of millions of instances.
down sampling
- uniform subsampling and negative down sampling

Uniform subsampling

a tempting approach
- easy to implement
- the resulting model can be used without modification on both the subsampled training data and non-subsampled test data.
By using only 10% of the data, the normalized entropy is only a 1% reduction in performance relative to the entire training data set.

My questions

How to do a uniform subsampling actually?

Negative down sampling

the negative down sampling rate has significant effect on the performance of the trained model.
The best performance is achieved with negative down sampling rate set to 0.025.

My questions

How to do a negative down sampling actually?

Model Re-Calibration

Negative downsampling can speed up training and improve model performance.
if a model is trained in a data set with negative downsampling, it also calibrates the prediction in the downsampling space.
- if the average CTR before sampling is 0.1% and we do a 0.01 negative downsampling, the empirical CTR will become roughly 10%.
- need to re-calibrate the model for live traffic experiment and get back to the 0.1% prediction

My questions

Didn't understand negative downsampling.

Discussion

Hybrid model architecture for click prediction

Data freshness matters. - worth retraining at least daily.
Transforming real-valued input features with boosted decision trees significantly increases the prediction accuracy of probabilistic linear classifiers.
Best online learning method: LR with per-coordinate learning rate, which ends up being comparable in performance with BOPR.

tradeoff between memory and latency

tradeoff between the number of boosted decision trees and accuracy
aggressively reduce the number of active features by measuring feature importance
historical features are superior to context features

References

O. Chapelle and L. Li. An empirical evaluation of thompson sampling.
Internet advertising and the generalized second price auction: Selling billions of dollars worth of keywords.
Greedy function approximation: A gradient boosting machine

https://projecteuclid.org/download/pdf_1/euclid.aos/1013203451

Processing sliding window multi-joins in continuous queries over data streams

http://www.cs.virginia.edu/~son/cs851/stream/papers/golab.pdf

Web-scale bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine.
Ad click prediction: a view from the trenches.
Predicting clicks: Estimating the click-through rate for new ads.
Data warehousing and analytics infrastructure at facebook.
Position auctions.
Predictive model performance: Offline and online evaluations

http://chbrown.github.io/kdd-2013-usb/kdd/p1294.pdf