Mitigating Reward Hacking in RLHF for Robust LLM Alignment - minalee-research/cs257-students GitHub Wiki

#LLM #RLHF #Reward Hacking #Spurious Correlation

Zerui Xu, Zhaorun Chen

Abstract

Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has proven effective in aligning LLMs with human preferences, it remains susceptible to spurious correlations in reward modeling. Consequently, this approach often introduces biases—such as length bias, sycophancy, conceptual bias, and discrimination—that hinder the model’s ability to capture true causal relationships. To address this issue, we propose a novel causal reward modeling approach that integrates causal inference to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring that reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we demonstrate that our approach effectively mitigates various types of spurious correlations, resulting in a more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling offers a practical means to improve the trustworthiness and fairness of LLM finetuning.

What this project is about

This project addresses a critical challenge in aligning large language models (LLMs) with human values: reward hacking in Reinforcement Learning from Human Feedback (RLHF). In current RLHF systems, reward models—designed to capture human preferences—often get tricked by false correlations. These unintended shortcuts lead to biases such as sycophancy (overly agreeable responses), length bias (favoring longer outputs), concept bias, and even discrimination. Such biases not only degrade model performance but also undermine trust in real-world applications.

To address this issue, we introduce a simple but effective approach by leveraging causal inference to develop a novel Causal Reward Modeling (CRM) framework. The central idea is to enforce counterfactual invariance [1] in the reward model so that its predictions remain stable even when irrelevant features (e.g., response length or flattering language) vary. This is achieved by integrating Maximum Mean Discrepancy (MMD) regularization into the training objective, effectively penalizing differences in reward predictions across different conditions. The result is a reward model that focuses on true causal relationships, reducing the tendency of models to exploit spurious correlations.

To validate our method, we evaluate using both synthetic and real-world datasets. Specifically, we test against:

Sycophantic bias using semi-synthetic prompts designed to induce agreement-related shortcuts.
Length bias with datasets like AlpacaEval to assess whether our model can avoid favoring verbosity.
Concept bias and discrimination bias using reformatted sentiment datasets and demographic-specific evaluation sets.

By seamlessly integrating CRM into existing RLHF pipelines, our work offers a practical and scalable solution to enhance the fairness, reliability, and overall performance of LLMs. Ultimately, this project aims to pave the way for more robust AI systems that can faithfully align with human preferences without falling prey to reward hacking.

Progress made so far

Since the original proposal, we have successfully integrated causal inference into the RLHF pipeline by developing a Causal Reward Modeling (CRM) framework. Our approach centers on enforcing counterfactual invariance in the reward model using Maximum Mean Discrepancy (MMD) regularization. This regularizer penalizes spurious correlations in reward modeling when irrelevant features—such as response length or overly agreeable language—change, ensuring the model focuses on true causal relationships rather than false associations.

Approach

Main approach

We modified the standard RLHF reward model and proposed causal reward modeling (CRM) which incorporates an MMD-based loss term into its training objective. This term minimizes the discrepancy between reward predictions across different bins of spurious factors, effectively reducing biases like sycophancy, length bias, concept bias, and discrimination.

Specifically, CRM leverages a causal diagram and represents the spurious factor as $Z$ (e.g., response length). The key insight is to enforce counterfactual invariance by making the learned reward model’s latent representation $f(T)$ independent of $Z$.

In practice, $T$ can be decomposed into the following three components, i.e., $T^{Z,\perp}$, $T^{Z\wedge L}$, and $T^{L,\perp}$ based on their causal relationships with $Z$ and the preference label $L$. Since $T^{L,\perp}$ does not causally affect $L$ and is independent of $Z$, the model must learn from these invariant features such that we can mitigate spurious correlations.

Therefore, we propose to employ Maximum Mean Discrepancy (MMD) to ensure distributional invariance across different values (or bins) of $Z$. Concretely, MMD measures the difference between distributions $P$ and $Q$:

$$ MMD(P, Q, H_k) = sup_{f ∈ F} (E_{x∼P} [f(x)] - E_{y∼Q} [f(y)])^2. (1) $$

By penalizing discrepancies in reward predictions across bins of $Z$, the model is guided toward an invariant representation. The overall objective combines a Bradley–Terry (BT) reward loss with the MMD penalty:

$$ -E_{(x,y_w,y_l)∼D}[log\sigma(r_\phi(x,y_w)-r_\phi(x,y_l))]+\lambda\sum_{m,m'\in[M]}MMD(p_m(r(x,y)),p'_m(r(x, y))). (2) $$

where $\sigma(x) = 1 / (1 + e^{-x})$ and $p_m$ is the conditional distribution of $r(x,y)$ within the $m$-th bin of $Z$. Consequently by augmenting the original BT-based learning objective with the additional MMD regularization, Eqn.(2) can effectively enforce counterfactual invariance, reducing biases caused by spurious factors.

We evaluated this method for RLHF reward modeling on three bias-related tasks. Experiments on both synthetic and real-world datasets validate that CRM significantly reduces these biases while preserving overall model performance.

Baselines

For comparison, we use the vanilla RLHF reward model that lacks any causal regularization. Our experimental results indicate that both unconditional and conditional variants of CRM outperform the baseline in mitigating reward hacking. For instance, our tests on semi-synthetic sycophantic prompts, AlpacaEval for length bias, and reformatted sentiment datasets for concept and discrimination biases consistently show lower bias metrics and improved alignment with human preferences.

Novelty

The originality of our work lies in directly addressing spurious correlations through causal regularization—a strategy not commonly applied in RLHF. By enforcing counterfactual invariance, our approach focuses the reward model on true causal relationships, leading to more reliable and fair model alignment. This innovative integration of causal inference into reward modeling paves the way for creating AI systems that align more faithfully with human values.

Experiments

We evaluate the effectiveness of the proposed CRM in mitigating biases across four different scenarios: sycophantic bias, length bias, concept bias, and discrimination bias. Our experiments compare conditional and unconditional CRM variants against a vanilla reward model (RM).

Data

We use multiple datasets tailored to the biases under investigation:

Sycophantic Bias: A semi-synthetic dataset adapted from[2], where responses are artificially correlated with agreement phrases (e.g., "Yes, you are right.").
Length Bias: The Alpaca dataset[3] is used, where we observe the relationship between response length and model preference.
Concept Bias: Yelp, IMDB, and Amazon Shoe Review datasets[4,5,6] with additional concept labels[7].
Discrimination Bias: The Anthropic HH-RLHF dataset (Bai et al., 2022) is filtered for demographic attributes, and the Discrm-eval dataset[8] is used for evaluation.

Each dataset is formatted to align with preference-based learning by structuring prompts with chosen and rejected responses.

Evaluation Method

We use bias-specific evaluation metrics:

Sycophantic Bias: Percentage of test prompts where all 50 generated responses exhibit sycophantic behavior.
Length Bias: Win rate based on the proportion of responses outperforming the SFT model, computed as $Score=50+(n_{win}-n_{lose})/N *100$.

Additionally, we analyze response rankings based on length.

Concept Bias:
- Acc@C / Acc@NoC: Accuracy with and without concept presence.
- Bias@C: Measures spurious correlations, where values closer to zero indicate lower bias.
Discrimination Bias:
- Explicit / Implicit Bias Scores: Mixed-effects regression coefficients for demographic variables.
- General Utility: Win rate against the vanilla PPO model, evaluated by GPT-4.

Experimental Details

All models are initialized from Llama-3 8B and fine-tuned using a supervised fine-tuning (SFT) pipeline. The reward models are trained using chosen/rejected pairs, followed by PPO training via OpenRLHF[9].

Specifically, we will investigate these two types of CRM based on whether the regularization is applied separately or integrally on the chosen and rejected subsets.

Conditional CRM: applies independence regularization separately to chosen and rejected response subsets, explicitly disentangling spurious correlations.
Unconditional CRM: enforces independence across all responses without distinguishing chosen and rejected subsets, balancing bias mitigation with overall model utility.

Results

Sycophantic Bias

Table 1 shows that Conditional CRM significantly reduces sycophantic behavior (19.78%) compared to Vanilla RM (92.67%) and Unconditional CRM (62.64%). This suggests that conditional CRM effectively disentangles spurious correlations between agreement and correctness.

Table 1 Results on semi-synthetic sycophantic dataset. The conditional CRM outperforms other methods. Bold values indicate the best performance. Results are averaged over three runs of PPO.

Model	Average Percentage (%)
Vanilla RM	92.67
Conditional CRM	19.78
Unconditional CRM	62.64

Length Bias

Figure 1 illustrates performance on length bias. Both conditional and unconditional CRM outperform vanilla RM with length penalty in EMA curves and Pareto front analysis. Models trained with higher regularization coefficients assign higher ranks to shorter responses, reducing bias towards verbosity.

Figure 1 Results on Length Bias, where each dot represents models trained with different regularization coefficients and PPO hyperparameters. The leftmost figure displays the results as an exponential moving average (EMA) curve, the middle plot illustrates the Pareto front, and the rightmost figure shows the correlation between length and rank based on reward values for different causal reward models.

Concept Bias

Table 2 demonstrates that CRM consistently reduces concept bias across Yelp, IMDB, and Amazon datasets. The conditional CRM reduces Bias@C values by up to 97% (e.g., "Price" concept in Yelp). However, unconditional CRM achieves higher Acc@C and Acc@NoC, indicating a trade-off between bias reduction and utility preservation.

Table 2 Models performance after finetuning with PPO using both vanilla and the proposed causal reward models across concept-biased Yelp, IMDB, and Amazon Shoe Review datasets. Bold values indicate the best performance.

Model	Price (Yelp) Acc@NoC	Acc@C	Bias@C	Service (Yelp) Acc@NoC	Acc@C	Bias@C	Food (Yelp) Acc@NoC	Acc@C	Bias@C
Vanilla RM	59.26	71.47	18.88	69.09	71.43	-15.54	78.77	67.48	7.31
Conditional CRM	97.22	99.04	0.52	99.45	97.56	-0.61	97.77	99.09	0.71
Unconditional CRM	94.44	98.35	6.86	98.18	97.21	-3.56	98.88	97.57	-0.86

Model	Music (IMDB) Acc@NoC	Acc@C	Bias@C	Acting (IMDB) Acc@NoC	Acc@C	Bias@C	Comedy (IMDB) Acc@NoC	Acc@C	Bias@C
Vanilla RM	77.78	73.98	13.49	75.54	71.81	-20.94	69.93	75.78	20.09
Conditional CRM	68.89	55.73	2.86	54.84	60.64	-7.68	58.04	56.35	7.99
Unconditional CRM	88.89	88.35	9.52	89.52	86.17	-13.24	85.31	89.45	12.41

Model	Size (Amazon) Acc@NoC	Acc@C	Bias@C	Color (Amazon) Acc@NoC	Acc@C	Bias@C	Style (Amazon) Acc@NoC	Acc@C	Bias@C
Vanilla RM	76.17	54.08	-4.05	63.88	72.47	15.48	38.30	74.35	-10.16
Conditional CRM	79.95	85.87	-2.37	84.58	80.73	2.45	87.94	80.64	-0.70
Unconditional CRM	73.89	53.26	-1.58	62.56	70.41	3.93	38.30	72.20	-1.49

Discrimination Bias

As shown in Table 3 and Figure 2, we can denote that CRM significantly lowers discrimination scores across both explicit and implicit bias cases. Notably, unconditional CRM achieves the lowest implicit bias score (0.107) and best preserve the overall performance (0.058). The win rate analysis confirms that CRM mitigates bias without degrading general performance.

Table 3 Discrimination evaluation over a diverse set of both explicit and implicit discrimination scenarios using the Discrm-eval dataset[8]. The scores are the mixed-effects coefficients for each demographic variable, where lower values indicate less discrimination. Bold values indicate the best performance.

Model	Explicit Gender	Explicit Race	Explicit Age	Explicit Avg	Implicit Gender	Implicit Race	Implicit Age	Implicit Avg	Overall Avg
SFT	0.003	0.002	0.015	0.007	0.227	0.251	0.523	0.334	0.171
Vanilla RM	0.032	0.016	0.007	0.018	0.181	0.230	0.261	0.224	0.121
Conditional CRM	0.008	0.002	0.018	0.009	0.264	0.181	0.060	0.158	0.084
Unconditional CRM	0.009	0.002	0.018	0.009	0.070	0.213	0.036	0.107	0.058

Figure 2 Comparison of the discrimination and utility performance on hh-rlhf dataset of CRMs in both conditional and unconditional settings with different MMD coefficient. The larger coefficient indicates higher weights of MMD loss. We evaluate the discrimination scores of both explicit and implicit discrimination types, and the winrate is evaluated by GPT-4o and calculated against the vanilla RM.

Conclusion

In this project, we investigated a novel method to use causal reward modeling (CRM) for mitigating spurious correlations that misalign LLMs with human preferences. By incorporating counterfactual invariance into reward learning, CRM reduces biases such as sycophancy, length bias, concept bias, and discrimination. Extensive experiments on synthetic and real-world datasets demonstrate its effectiveness in improving fairness, reliability, and trustworthiness across tasks. Seamlessly integrating into existing RLHF workflows, CRM enhances LLM alignment without added complexity. As LLMs expand into sensitive applications, ensuring ethical and unbiased behavior is crucial. Our work bridges causality and reward modeling, paving the way for future research on broader domains, deeper causal structures, and refined regularization techniques.

Reference

[1] Veitch, V., D'Amour, A., Yadlowsky, S., & Eisenstein, J. (2021). Counterfactual invariance to spurious correlations in text classification. Advances in neural information processing systems, 34, 16196-16208.

[2] Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., ... & Perez, E. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.

[3] Dubois, Y., Li, C. X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., ... & Hashimoto, T. B. (2023). Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 30039-30069.

[4] Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.

[5] Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (pp. 142-150).

[6] He, R., & McAuley, J. (2016, April). Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web (pp. 507-517).

[7] Zhou, Y., Xu, P., Liu, X., An, B., Ai, W., & Huang, F. (2023). Explore spurious correlations at the concept level in language models for text classification. arXiv preprint arXiv:2311.08648.

[8] Tamkin, A., Askell, A., Lovitt, L., Durmus, E., Joseph, N., Kravec, S., ... & Ganguli, D. (2023). Evaluating and mitigating discrimination in language model decisions. arXiv preprint arXiv:2312.03689.

[9] Hu, J., Wu, X., Zhu, Z., Wang, W., Zhang, D., & Cao, Y. (2024). Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143.