[25.05.12] All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine‐Tuning - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Authors: Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell
Published In: arXiv preprint (cs.LG)
Year: 2025 (arXiv:2503.01067v1, submitted 3 Mar 2025)
Link: https://arxiv.org/abs/2503.01067
Date of Discussion: May 12, 2025 (based on transcript metadata)

Summary

Research Problem: The paper investigates why two-stage reinforcement learning (RL) based fine-tuning (RLHF/online FT) empirically outperforms direct offline fine-tuning methods (like DPO/MLE) on preference data, even though, from an information-theoretic perspective, passing data through a reward model (RM) should lead to information loss, and on-policy sampling shouldn't create new information.
Key Contributions:
1. Proves theoretical equivalence between online (RLHF) and offline (MLE/DPO) Preference Fine-Tuning (PFT) under idealized assumptions (e.g., isomorphic policy and reward model classes).
2. Systematically evaluates and provides evidence against several existing or novel hypotheses for the observed online-offline performance gap.
3. Proposes and supports a new hypothesis (H6): The "generation-verification gap." Online FT excels because it's easier to learn a relatively simple verifier (reward model) from preference data. The subsequent RL procedure then filters its search space to policies optimal for this simpler verifier, effectively performing "proper learning" on a reduced policy space, which is more effective than "improper learning" over the entire policy space as done by offline methods.
Methodology/Approach: The paper uses theoretical analysis based on information geometry and conducts controlled experiments comparing online DPO (as a form of online PFT) and offline DPO on summarization tasks. It systematically tests various hypotheses for the performance gap.
Results:
- Online PFT (using an online DPO variant) consistently outperforms offline DPO, even when controlling for optimizers, base models, initial data, and the number of gradient steps.
- The "generation-verification gap" hypothesis (H6) is found to be the most consistent explanation for the empirical observations. When experiments are designed to close this gap (e.g., by making the generation task very simple or the verification task very complex), the performance advantage of online PFT over offline PFT diminishes, as predicted by H6.

Discussion Points

Strengths:
- Addresses a fundamental and practically relevant question in LLM alignment.
- The systematic approach to testing multiple hypotheses is rigorous.
- The "generation-verification gap" is an intuitive and compelling explanation, drawing parallels to concepts like P vs. NP.
- The paper is thought-provoking and encourages a deeper understanding of RLHF mechanisms.
Weaknesses:
- The paper is very dense and mathematically involved, making it difficult to read and fully grasp.
- Major Point of Contention/Alternative Explanation: The experimental setup for "online DPO" involves sampling 25 completions per prompt, ranking them with the learned RM, and then using the top-ranked and bottom-ranked completions as a new preference pair for DPO training. This process itself might generate higher-quality, more distinct preference data (i.e., pairs with a clearer preference signal) compared to the original offline dataset. This difference in data quality, rather than the "proper learning" aspect of H6, could be a simpler and more direct explanation for the observed performance gap. The paper acknowledges data is the only difference but attributes the benefit to the RL process filtering the policy space.
- The connection to complex concepts like P vs. NP or detailed information geometry might obscure potentially simpler underlying mechanisms.
Key Questions:
- Is the superior performance of online PFT primarily due to the "generation-verification gap" and "proper learning" (H6), or is it more significantly influenced by the data augmentation/selection effect from the 25x sampling and top/bottom selection in the online DPO setup?
- How can the information loss from using an RM be reconciled with the improved performance, if not solely by H6?
- What are the precise practical implications of forward KL vs. reverse KL divergence in the context of projecting data onto policy/reward models during fine-tuning?
Applications:
- A better understanding of why RLHF works can lead to the development of more efficient, robust, and data-effective LLM alignment strategies.
- Could inform the design of better offline methods that can match online performance.
Connections:
- Directly relates to DPO, PPO, SFT, and the broader field of RLHF.
- Connects to fundamental concepts in information theory, statistical learning (proper vs. improper learning), and computational complexity.

Notes and Reflections

Interesting Insights:
- The formalization of the intuition that "verifying is easier than generating" is a core theme.
- Even if reward models and policies are theoretically "isomorphic" (can represent the same functions), the difficulty of learning them from finite data can differ significantly.
- The paper argues that local RMs (as implicitly used in DPO) are akin to Q-functions. Optimizing these is closer to direct policy optimization and thus doesn't fully escape the difficulty of learning the generator directly, unlike a two-stage approach with a global RM that might be simpler to learn.
- The discussion on information loss via the RM (e.g., reducing rich preference data to scalar rewards) is a critical starting point.
Lessons Learned:
- Subtle differences in experimental setups, especially in how data is processed or generated (like the 25x sampling for online DPO), can have significant impacts on results and their interpretation.
- Theoretical equivalences often rely on idealized assumptions that may not hold in practical, complex systems like LLMs.
Future Directions:
- Design experiments to explicitly disentangle the effect of the 25x sampling (potential data quality improvement) from the "proper learning on a simpler verifier" aspect of H6. For instance, could one generate such "high-contrast" preference data offline and see if it allows offline DPO to match online DPO performance?
- Further investigation into the "complexity" (e.g., circuit complexity, learnability) of reward models versus policies in practical LLM fine-tuning scenarios.