trust worthy online experiment notes - sophiekeke/casestudy GitHub Wiki

Chapter 6:

Measuring Metrics Pavel Dmitriev Xian Wu 2016
[Improving the Sensitivity of Online Controlled Experiments: Case Studies at Netflix] (https://www.kdd.org/kdd2016/papers/files/adp0945-xieA.pdf)

4.3.1 Sensitivity

In the context of a controlled experiment, sensitivity of a metric refers to the amount of data needed for the metric to show that a treatment-control delta of a specific magnitude is statistically significant. Sensitivity is important because more sensitive metrics allow detecting small changes sooner, shortening the time required for running an experiment and improving experimentation and decision-making agility. While, as we discuss later, sensitivity is not the only aspect to consider when deciding which metric to focus on, comparing metrics on the sensitivity axis provides useful insights.

Assuming we are not changing the statistical test used, sensitivity depends on three factors:

The amount of data (number of users or queries in our case)
The variance of the metric
The effect size (treatment-control delta)

Normalization:

Reduction in Variance:
- Normalization helps in reducing the variance of the data, which directly increases the sensitivity of the metrics. When metrics are normalized, such as dividing queries with clicks by total queries to get the Query Click Rate, the values are bounded within a specific range (0 to 1 in this case). This bounded range reduces the influence of extreme values (outliers) and leads to a more consistent and sensitive measure.
Bounded Values:
- Metrics like Queries per Session are more sensitive because the number of queries per session is naturally limited compared to the total number of queries per user, which can grow very large. By focusing on sessions, the metric becomes more stable and sensitive to changes within those smaller, bounded periods.
Example:
- "Queries per Session" is more sensitive than "Queries per User" because sessions are smaller and more defined time periods, leading to less variability in the number of queries, thus making it easier to detect changes.

Truncation:

Elimination of Outliers:
- Truncation involves capping the values of a metric to remove extreme values that can distort the analysis. By eliminating these outliers, the overall variance of the metric is reduced, making the metric more sensitive to actual changes in the data.
Example:
- The paper mentions capping the "Revenue per User" metric at $10. This means that any revenue amount above $10 is treated as $10. This cap reduces the variance caused by extremely high revenue values and makes it easier to detect smaller changes in user spending patterns.

Variance Reduction:

The primary reason these techniques increase sensitivity is that they reduce the noise in the data. High variance metrics are influenced by extreme values and outliers, making it difficult to detect genuine changes or trends. By normalizing or truncating the data, the variance is reduced, and the metric becomes more stable and sensitive to actual changes in user behavior or system performance.

4.3.2 Alignment with User Value

Direction of Label Agreement:

Positive (+): Increased metric values lead to higher agreement.
Negative (-): Decreased metric values lead to higher agreement.

Examples:

Ads Click Rate: Has a negative direction, indicating that increased engagement with ads typically degrades user value, while poorer search results quality increases ad engagement.
Queries per User and Queries per Session: Show better agreement in the negative direction, as improved search quality reduces the need for query reformulations, leading to fewer queries.