[25.05.08] Layers at Similar Depths Generate Similar Activations Across LLM Architectures - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Layers at Similar Depths Generate Similar Activations Across LLM Architectures
Authors: Christopher Wolfram, Aaron Schein
Published In: Preprint (arXiv:2504.08775v1 [cs.CL])
Year: 2025 (Preprint submission date: 3 Apr 2025)
Link: arxiv
Date of Discussion: 2025.05.08

Summary

Research Problem: The paper investigates how the latent spaces (specifically, nearest neighbor relationships of activations) of independently-trained Large Language Models (LLMs) relate to one another. It explores whether there are universal properties in how different LLMs represent information at various layers.
Key Contributions:
1. Claim 1: Activations collected at different depths within the same model tend to have different nearest neighbor relationships. (The set of nearest neighbors for a given input changes as you move through the layers of a single model).
2. Claim 2: Activations collected at corresponding depths of different models tend to have similar nearest neighbor relationships. (Despite architectural differences, different models show similar nearest neighbor patterns for the same input at proportionally similar layer depths).
- Together, these suggest LLMs generate a progression of activation geometries from layer to layer, and this entire progression is largely shared between models, "stretched and squeezed" to fit different architectures.
Methodology/Approach:
- Used 24 open-weight LLMs.
- Fed a dataset of 2048 texts (primarily OpenWebText) to each model.
- Collected activations from the end of each decoder module (referred to as a "layer") for the last token of the input.
- For each input text and each layer, identified the k=10 nearest neighbor texts based on the cosine similarity of their activations.
- Compared these sets of nearest neighbors (mutual k-NN) between layers (within the same model and across different models) to create affinity matrices.
- Visualized these affinity matrices as heatmaps to observe patterns.
Results:
- Affinity matrices comparing layers from different models consistently exhibit a strong diagonal structure. This indicates that layers at proportionally similar depths across diverse models share similar nearest neighbor relationships for their activations.
- This diagonal structure is statistically significant.
- The nearest neighbor relationships within a single model change from layer to layer (supporting Claim 1).
- Instruction tuning primarily changes the activation structure in late layers when models are given instruction-following tasks, while early layers remain similar to base models.
- Using random alphanumeric strings as input does not produce the diagonal structure, suggesting the phenomenon is tied to meaningful linguistic input processing.

Discussion Points

Strengths:
- The methodology is very simple and intuitive (calculating nearest neighbors of activations using cosine similarity).
- The findings, especially the consistent diagonal similarity across diverse models, are surprising and insightful.
- Provides a novel perspective on shared internal mechanisms of LLMs.
- The paper is more about "showing a phenomenon" rather than proving a complex hypothesis, making its core findings clear.
- The results offer good "interpretability" insights without overly complex methods.
Weaknesses:
- The paper is largely observational; it shows that these patterns exist but doesn't deeply investigate or prove why they emerge.
- Many of the interpretations (e.g., specific reasons for certain neighbor groupings, functional roles of layers) are plausible hypotheses derived from the observations but are not experimentally validated within this paper.
- The speaker noted the paper doesn't make very strong claims beyond presenting the observed phenomena.
Key Questions:
- Why do independently trained models, even with different architectures, converge to such similar representational structures (nearest neighbor geometries) at corresponding layers?
- What specific computations or features are being captured at different layers that lead to these distinct yet cross-model-consistent patterns?
- How exactly does instruction tuning modify the functionality of later layers?
Applications:
- Could provide insights into why model pruning can be effective (if layers perform proportionally similar tasks, some might be compressible).
- May help explain successes or failures in model merging or ensembling techniques.
- Contributes to the broader field of mechanistic interpretability of LLMs.
Connections:
- Relates to work by Anthropic on interpreting transformer layers (e.g., early layers handling syntax, later layers handling more abstract concepts).
- Builds upon the general idea of representational similarity analysis (RSA) in neural networks, though using a simpler k-NN approach compared to methods like CKA or SVCCA.
- The findings about instruction tuning align with the idea that alignment primarily targets higher-level, more abstract functionalities typically handled by later layers.

Notes and Reflections

Interesting Insights:
- The "stretching and squeezing" of a shared representational progression to fit models of different depths is a powerful visual and conceptual takeaway.
- Early layers across models of different sizes tend to align more directly (1-to-1), while middle layers of larger models might "compress" or represent more abstract functions compared to smaller models.
- The clear distinction in how instruction tuning affects later layers versus earlier layers, and how this is dependent on the input data (instruction-following vs. general text).
- The disappearance of the diagonal structure with random input strings strongly suggests the observed similarities are tied to the models processing meaningful linguistic patterns.
Lessons Learned:
- Simple, well-designed experiments can yield significant and surprising insights into complex systems like LLMs.
- Observing and clearly presenting a phenomenon can be a valuable contribution, even if the underlying causes are not fully elucidated in the same work.
Future Directions:
- Experimentally verify the hypothesized functional roles of layers based on these similarity patterns (e.g., probing for specific features).
- Investigate the causal mechanisms behind the formation of these shared activation geometries.
- Explore if these findings can be leveraged for more efficient cross-architecture knowledge transfer, model distillation, or better pruning techniques.