[25.05.29] Group Normalization - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Group Normalization
Authors: Yuxin Wu, Kaiming He
Published In: arXiv:1803.08494 [cs.CV] (Work from Facebook AI Research - FAIR)
Year: 2018
Link: https://arxiv.org/abs/1803.08494
Date of Discussion: 2025.05.29

Summary

Research Problem: Batch Normalization (BN) is dependent on batch size, leading to increased error and instability with smaller batch sizes. This limits its use in tasks requiring small batches due to memory constraints (e.g., detection, segmentation, video).
Key Contributions: Group Normalization (GN) is proposed as an alternative that is independent of batch size. It divides channels into groups and normalizes within these groups, providing stable accuracy across a wide range of batch sizes.
Methodology/Approach: GN normalizes features along the (H, W) dimensions and a group of C/G channels, for each sample N independently. It computes mean and variance within these defined groups. Like other normalization methods, it learns per-channel affine transformation parameters (gamma and beta).
Results: GN shows significantly better performance than BN with small batch sizes (e.g., 10.6% lower error on ResNet-50 ImageNet with batch size 2). It is comparable to BN with typical batch sizes and outperforms other non-batch-dependent methods like Instance Normalization (IN) and Layer Normalization (LN) in visual recognition tasks. It also transfers well to downstream tasks.

Discussion Points

Strengths:
- Batch Independence: The primary advantage, making it robust to small batch sizes (Attendees 1 00:14, Attendees 2 01:31).
- Simplicity: Conceptually viewed as a variation of Layer Normalization, where LN is GN with G=1 (Attendees 1 01:12).
- Clear Visualization: Figure 2 in the paper effectively illustrates the different normalization strategies (Attendees 1 02:32, Attendees 2 04:18).
- Effectiveness in Training/Fine-tuning: Works well for both (Attendees 1 00:14).
- Affine Transformation: The learnable gamma and beta parameters are crucial for restoring representational power after normalization (Attendees 1 21:15, referencing Karpathy).
Weaknesses:
- Initial Complexity of Figure 2: Can be slightly confusing at first glance (Attendees 1 02:32, Attendees 2 02:26).
- Nuance in Application (Image vs. NLP): The discussion highlighted that how normalization (especially LayerNorm, and by extension GN) is applied differs between image data (normalizing over C, H, W) and NLP data (often normalizing only over the embedding dimension, not sequence length) (Attendees 2 05:18 - 09:57).
- Potential for "Destructive" Behavior / Emergent Issues: Attendees 1 discussed an example (25:00 onwards) from Stable Diffusion's VAE where GN might lead to "splotches" (extreme neuron activations). This was interpreted as the model trying to preserve features by having specific neurons fire with very high magnitude to "escape" the group-wise normalization, which is then scaled down by a subsequent linear layer with small weights. This suggests GN can sometimes force models into "개고생" (undesirable struggles) (Attendees 1 28:29).
Key Questions:
- How does the application of GN/LN differ fundamentally between vision (CHW) and NLP (sequence length, embedding dimension)? (Attendees 2 05:18)
- Is normalizing over the sequence length in NLP beneficial or meaningful? (Attendees 2 09:57)
- Why do some newer architectures (e.g., Qwen models) use RMSNorm instead of LayerNorm or GN? (Attendees 2 23:34)
- What causes the "splotches" (extreme neuron activations) phenomenon observed with GN in some specific models like Stable Diffusion VAE? (Attendees 1 25:00)
- How does one practically decide which normalization technique is optimal for a given task/model, beyond general guidelines? (Attendees 1 24:30 - "짬으로 알아야 되지 않나" - seems to require experience/intuition).
Applications:
- Image classification (ImageNet).
- Object detection and segmentation (COCO, Mask R-CNN), especially when fine-tuning with small batch sizes.
- Video classification (Kinetics).
- Potentially NLP, though the exact application (which dimensions to normalize) was debated.
- Generative models (as seen in the Stable Diffusion VAE example, though with caveats).
Connections:
- Directly compared with Batch Normalization (BN), Layer Normalization (LN), and Instance Normalization (IN).
- RMSNorm mentioned as a more recent alternative used in some LLMs (Attendees 2 23:34).
- The "splotches" discussion connects to model interpretability and understanding unintended emergent behaviors in deep networks.

Notes and Reflections

Interesting Insights:
- The detailed discussion on how normalization applies differently to image (H, W, C) versus NLP (sequence length, embedding dimension) was a key point of clarification (Attendees 2 05:18 onwards).
- The "splotches" phenomenon (Attendees 1 25:00 onwards) was a deep dive into a potential side-effect of GN, where neurons might fire with extreme magnitudes to counteract normalization, only to be scaled down later. This suggests models can find complex, sometimes "악질적인" (malicious/problematic), ways to preserve information.
- The re-emphasis on the role of learnable scale (gamma) and shift (beta) parameters in "recalibrating" the distribution after normalization, as explained by Karpathy for BN, applies here too (Attendees 1 21:15).
Lessons Learned:
- The choice of normalization layer is highly context-dependent (data modality, batch size constraints, specific architecture).
- Even seemingly simple components like normalization layers can lead to complex and sometimes counter-intuitive emergent behaviors in deep neural networks.
- A clear understanding of which dimensions are being normalized over is crucial for both implementation and debugging.
Future Directions:
- Further investigation into why RMSNorm is gaining traction over LayerNorm/GN in some modern architectures.
- Deeper analysis of emergent phenomena like the "splotches" to understand if they are bugs, features, or controllable aspects of training with certain normalizations.
- Developing more robust heuristics or automated methods for selecting the optimal normalization strategy for a given problem.