[25.04.12] Inference‐Time Scaling for Generalist Reward Modeling - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Inference-Time Scaling for Generalist Reward Modeling
Authors: Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu
Published In: Preprint. Under review. (arXiv:2504.02495v2 [cs.CL])
Year: 2025 (Preprint date)
Link: https://arxiv.org/abs/2504.02495v2
Date of Discussion: 2025.04.12 (Attendees: 허진호, 김훈태)

Summary

Research Problem: The paper addresses the challenge of creating accurate and scalable reward models (RMs) for large language models (LLMs) in general domains where ground truth is often unavailable or complex. Specifically, it focuses on improving RM performance by scaling inference-time computation rather than just training compute or model size, and overcoming limitations of existing methods (like input flexibility, accuracy, and inference scaling issues).
Key Contributions: The paper introduces:
1. Self-Principled Critique Tuning (SPCT): A learning method for Pointwise Generative Reward Models (GRMs) that uses rule-based online RL (GRPO) to adaptively generate principles and critiques, aiming for better reward quality.
2. Inference-Time Scaling: Techniques using parallel sampling and voting (optionally guided by a Meta RM) to improve reward accuracy by using more compute at inference time.
3. DeepSeek-GRM: Models trained using this approach.
Methodology/Approach: The core approach is Pointwise Generative Reward Modeling (GRM), which outputs textual critiques and scores. SPCT involves a rejective fine-tuning cold start (removing incorrect and "too easy" trajectories) followed by rule-based online RL (GRPO) to optimize principle/critique generation. Inference scaling involves sampling k critiques/rewards in parallel and aggregating them via voting (summing scores) or a trained Meta RM.
Results: SPCT is shown to improve GRM quality and scalability, outperforming baseline methods on several RM benchmarks with less bias. Inference-time scaling demonstrates significant performance gains, potentially outperforming training-time model scaling (e.g., a scaled 27B model competing with larger models).

Discussion Points

Strengths:
- Addresses a relevant and important problem: improving generalist reward models for RLAIF/RLHF (0:27, 1:01).
- The paper is well-written and sets up the problem clearly (1:01, 1:07).
- The idea of dynamically generating principles per query is sound (vs. static Constitutional AI) (10:38, 21:25).
- The inference-time scaling approach shows promising results and offers a different scaling dimension (0:39, Fig 4 discussion).
- The GRM approach offers flexibility over scalar/pairwise RMs (7:26).
- The rejection sampling strategy (removing "too easy" examples) was found interesting (22:20, 24:05).
Weaknesses:
- Perceived as less algorithmically novel or impactful compared to prior work like GRPO; more of a framework built on existing ideas (0:03, 0:15, 4:01, 4:22).
- Some explanations were found confusing, particularly the summation logic for voting (vs. averaging) and details on principle filtering (16:30, 27:26, 29:15, 33:34).
- Generative RMs are inherently slower than scalar RMs (40:01).
- Still lags behind scalar RMs on verifiable tasks (e.g., math), though reference helps (40:22, Appendix E.1.3).
- The comparison between DeepSeek-GRM and DeepSeek-R1 regarding token usage might compare different types of models (finetuned vs general) (46:19).
Key Questions:
- How exactly are the self-generated principles filtered or selected (13:04, 15:26, 16:30)? The paper mentions filtering aligned with ground truth in prelims, but less clear in the main method.
- What is the precise benefit/rationale for rejecting "all correct" (too easy) trajectories during RFT (24:17)? (Discussion suggested harder learning or ranking difficulty).
- Why explain voting aggregation as summing scores (expanding the range) rather than averaging, which seemed more intuitive to attendees (though functionally similar for ranking) (29:15, 33:34)?
- Is there a clearer correlation between response length changes (Fig 7) and actual quality improvement (48:42)?
Applications:
- Improving RLAIF pipelines by creating better, more scalable reward models.
- Potential for offline LLM evaluation using the generated principles/critiques (42:03).
- Foundational work for improving future large models (e.g., potential Llama 3 competitors) (52:03, 52:21).
Connections:
- Builds directly on GRPO (implementation, online RL) (4:01, 27:06).
- Positions itself relative to RLHF, RLAIF, Constitutional AI (11:37), DPO.
- Compares against LLM-as-a-Judge, PairRM, scalar RMs (Table 2, Fig 2).
- The work is situated within DeepSeek's ongoing research (DeepSeek-R1, DeepSeek-V2/V3 mentioned) (0:03, 45:59, 52:03).

Notes and Reflections

Interesting Insights:
- The clear distinction made between improving the reward model (this paper) versus improving the policy model directly (like standard GRPO) (4:22).
- Inference-time compute scaling can be a very effective alternative/complement to training-time model size scaling (Fig 4 discussion).
- The idea of rejecting perfectly solved simple problems ("too easy") during training to potentially force learning on harder examples (22:20).
- Generative RMs provide richer feedback but come with efficiency trade-offs (7:26, 40:01).
Lessons Learned:
- Reward model quality is critical and improving it is an active research area.
- There are multiple paradigms for reward modeling (scalar, semi-scalar, generative, pointwise, pairwise), each with pros and cons regarding flexibility, scalability, and bias.
- "Scaling" can refer to training data, model size, or inference compute, each yielding different results.
Future Directions:
- Improving the efficiency of GRMs (40:01, Appendix B).
- Applying SPCT/GRM to different base models and tasks.
- Further investigation into the principle generation and filtering mechanism (16:30).
- Exploring the use of tools (code interpreters, search) to enhance critique accuracy (Appendix B).
- Using GRMs for more interpretable offline model evaluation (42:03).

transcript