[25.07.03] Deep Double Descent: Where Bigger Models and More Data Hurt - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT
Authors: Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever
Published In: arXiv preprint
Year: 2019
Link: https://arxiv.org/abs/1912.02292
Date of Discussion: 2025.07.03

Summary

Research Problem: The paper investigates the apparent contradiction between the classical bias-variance tradeoff (where models that are too complex overfit and perform worse) and the modern deep learning practice where larger models often achieve better results. It seeks to understand the conditions under which bigger models and more data can surprisingly lead to worse performance.
Key Contributions:
1. It demonstrates that the "double descent" phenomenon—where test error first increases and then decreases as model complexity grows—is a robust and widespread occurrence in modern deep learning.
2. It shows this phenomenon occurs not only as a function of model size (model-wise double descent) but also as a function of training time (epoch-wise double descent).
3. It introduces the concept of Effective Model Complexity (EMC) to provide a unified framework for understanding these behaviors.
4. It identifies a "sample non-monotonicity" regime where, counter-intuitively, increasing the amount of training data can actually harm test performance.
Methodology/Approach: The study is primarily empirical, conducting extensive experiments across various architectures (ResNets, CNNs, Transformers), datasets (CIFAR-10/100, IWSLT), and training configurations (e.g., optimizers, label noise, data augmentation) to validate the double descent hypothesis.
Results: The test error consistently peaks when the model's EMC is approximately equal to the number of training samples. This "critical regime" or "interpolation threshold" is the point where the model is just complex enough to fit the training data. Beyond this peak, in the over-parameterized regime, increasing complexity (either by model size or training time) again improves performance.

Discussion Points

Strengths:
- The paper provides extensive empirical evidence showing that double descent is a general phenomenon, not specific to one type of model or dataset.
- The introduction of EMC offers a compelling, unified way to think about the combined effects of model size, training time, and other factors.
Weaknesses:
- The paper is largely observational ("this happens") and provides an intuition rather than a formal theoretical proof for why double descent occurs in complex deep networks.
- The definition of EMC itself is acknowledged as informal and abstract, making it difficult to measure precisely.
Key Questions:
- Why does more data sometimes hurt? The discussion centered on the paper's intuition: in the "critical regime," the model is forced into a single, brittle solution to fit the data. More data in this state can increase confusion rather than aid generalization. This was likened to a "깔딱고개" (a steep, difficult pass), where the model struggles before reaching a better state.
- Is double descent just a modern discovery? The group speculated that this phenomenon might have always existed, but past limitations in computational power and data prevented researchers from training models long or large enough to move past the initial error peak.
Applications:
- It provides a crucial mental model for practitioners: if a model's performance is poor and it seems to be just barely fitting the training data, small changes (like adding a bit more data or making the model slightly larger) could unexpectedly make things worse.
- It challenges the universal applicability of "early stopping," suggesting that if one can afford the computation, training past the initial overfitting peak can lead to a second, often better, performance minimum.
Connections:
- One participant connected the findings to their own research on interpreting model behavior through loss curves.
- The group drew an analogy to human learning: knowing a moderate amount about a topic can lead to more confusion and mistakes than knowing very little or having achieved mastery.

Notes and Reflections

Interesting Insights:
- The most surprising realization for the group was that double descent applies to model size, not just training epochs. This shifted the understanding from a temporal phenomenon to a more fundamental property of model complexity.
- The idea that "more data can be worse" was highly counter-intuitive but became a central point of discussion, highlighting the non-linear and sometimes unpredictable nature of deep learning.
Lessons Learned:
- The classical U-shaped bias-variance curve is an incomplete picture for modern deep learning.
- The "complexity" of a training process is a multifaceted concept that EMC attempts to capture, unifying factors like architecture, training duration, and regularization.
Future Directions:
- Developing a more formal, theoretical understanding of EMC and the mechanisms behind double descent in deep neural networks remains a key open problem.