Exploring Gender Biases in Instruct vs. Base LLM Models - minalee-research/cs257-students GitHub Wiki

#analysis, #evaluation, #ethics
Kaitlyn Li, Arnav Agarwal, Kyle Wong

Code found here
https://github.com/kylnwon/nlp_final_project


Abstract

Bias in language models, particularly gender bias, can lead to unfair and discriminatory outcomes in AI applications. Large language models trained on vast amounts of internet text inherently learn societal biases present in the data. However, instruction tuning has been shown to align models more closely with human intentions and ethical guidelines. This project investigates whether instruction tuning mitigates gender bias and whether instruction-tuned models produce fairer outputs than their base, non-instruction-tuned counterparts. We query both base and instruct models using a custom dataset of ambiguous pronoun resolution tasks, comparing their outputs against real-world labor statistics. By analyzing model responses and deviations from actual demographic distributions, we aim to better understand how instruction tuning affects bias in LLMs.

Our findings indicate that instruction tuning does not uniformly decrease bias; in some cases, it may overcorrect biases, while in others its impact is minimal. We discuss these trends, present detailed statistical tests, and propose future directions for more inclusive and robust NLP systems.


Introduction

Over the past decade, LLMs have become the foundation of many NLP applications, from chatbots to content generation platforms. Yet these models inherit societal biases from their training data. Gender bias is especially concerning, as it can perpetuate stereotypes and undermine fairness in tasks like hiring or medical triage, where they have been seeing increasing popularity. In turn, this propagates these biases forward.

Why Instruction Tuning?

Instruction tuning exposes LLMs to a variety of instructions and feedback—while it has shown promise in reducing toxicity and improving compliance, the extent to which it mitigates gender biases specifically remains fairly ambiguous.

Research Questions:

As such, this project seeks to answer:

  • Do instruction-tuned models demonstrate less gender bias compared to their base counterparts?
  • How do these biases compare to real-world gender distributions in different professions?
  • Does instruction tuning correct biases, overcorrect them, or introduce new distortions?

Literature Review

Prior work has investigated gender bias in LLMs, but our study uniquely examines whether instruction tuning itself mitigates or exacerbates bias. By contrasting instruct vs. base variants within the same model families, we isolate the impact of instruction tuning more precisely than previous work. In review of the current literature of the field, there were 3 main papers that we found that related to particular aspects of our project.

Review #1: Evaluating Large Language Models through Gender and Racial Stereotypes $^{[1]}$

This paper examines the presence of gender and racial biases in large language models and introduces a framework to systematically evaluate these biases. By analyzing models like GPT-3.5, the study highlights how LLMs often reinforce stereotypes, sometimes even overcompensating by incorporating both positive and negative biases. Our experiment shares a similar goal of evaluating biases in LLMs, specifically in how they resolve ambiguous pronoun references involving professions. While this paper provides an overarching evaluation of biases in workplace-related contexts, our study builds on this by focusing on gendered assumptions in coreference resolution tasks. The paper offers valuable insights into methodology and model behavior, helping us refine our approach to assessing bias in LLM predictions.

Review #2: Forcing Diffuse Distributions out of Language Models $^{[2]}$

This paper addresses the issue of low diversity in LLM outputs, demonstrating how models often generate skewed distributions even when randomness is expected. The authors introduce a fine-tuning approach that encourages models to produce more diverse responses across various tasks, from number generation to synthetic biography creation. Our experiment similarly grapples with evaluating LLM biases, particularly in gender representation within coreference resolution. A key insight from this paper is its discussion on defining an "expected" output distribution. Should an unbiased model produce an equal gender split, or should it reflect real-world demographics? This question is crucial to our study, as we aim to determine which of these biases current LLMs more closely resemble. This work provides valuable context for our approach to assessing bias in LLM predictions.

Review #3: Finetuned Language Models are Zero-Shot Learners $^{[3]}$

This paper explores instruction tuning, a technique that fine-tunes language models on a diverse set of NLP tasks framed as natural language instructions. The authors show that instruction-tuned models, like FLAN, exhibit significantly better zero-shot generalization compared to standard pretrained models. This is particularly relevant to our project, as we are comparing base language models to instruct-tuned models to understand how instructional fine-tuning impacts performance and bias. Given that instruction tuning aims to improve generalization across tasks, analyzing its effects on gender biases in model outputs will help us determine whether instruct models not only follow the instructions they are fine-tuned for better or if they also produce more balanced and equitable responses.


Dataset & Methodology

To investigate this, we compare the outputs of instruction-tuned and base LLMs using pronoun resolution tasks. Initially, we considered using the uclanlp/wino_bias dataset from Hugging Face and the Winogender schemas dataset from Rudinger et al. However, we found that these datasets often contained too much context explicitly linking professions to pronouns, making sentences unambiguous. To address this, we generated our own dataset using GPT, ensuring the sentences were truly ambiguous and free from external contextual cues that could bias the models' responses.

Dataset Design and Generation

We initially considered using the uclanlp/wino_bias dataset and the Winogender schemas dataset from Rudinger et al. but we found that they contained too much explicit context linking professions to pronouns, leaving little room for ambiguity and diluting the evaluation of gender bias detection. $^{[4]}$ For instance:

The nurse notified the patient that her shift would be ending in an hour. The nurse notified the patient that his shift would be ending in an hour.

Instead, we generated a dataset using GPT to ensure a more neutral and controlled evaluation.

Each sentence includes:

  • Two occupations, where one of the occupations is stereotypically considered male and one is stereotypically considered female. (Note: some sentences had both occupations which were stereotypically perceived to be worked by the same gender, in order to test model performance in ambiguous situations).
  • A gendered pronoun (“he” or “she”).
  • Minimal contextual clues to force reliance on gender biases.

Drawing inspiration from winogender still, and in order to minimize bias due to syntax (sentence subject), each sentence had 4 variations. For instance:

The doctor and the nurse met because she had an update.
The doctor and the nurse met because he had an update.
The nurse and the doctor met because she had an update.
The nurse and the doctor met because he had an update.

In order to capture variability, our final dataset included 100 such unique sentences (100 x 4 prompts), spanning over 100 professions.

Models & Query Strategy

We evaluated three models across base and instruct variants:

  1. Mistral-7B Base vs Mistral-7B Instruct
  2. Mixtral-8x7B Base vs Mixtral-8x7B Instruct
  3. Qwen QwQ-32B Base vs Qwen 2.5 Coder 32B Instruct

We evaluated several querying strategies, yet eventually decided on querying using a "few-shot approach" for both base and instruct models, with an example demonstrating correct pronoun resolution. This was predominantly due to the nature of our experiment - directly comparing between outputs from base and instruct models necessitates controlling for the prompts that both types receive.

All queries were sent via the TogetherAI API.

Evaluation Metrics

  1. Accuracy: Whether the model’s assigned referent aligns with a ground truth or a pre-labeled answer.
  2. Mean Absolute Error (MAE) w.r.t. BLS
  3. McNemar’s Test: Statistical check for paired differences between base vs. instruct outputs.
  4. ANOVA: Across all models to see if any systematic difference in bias metrics emerges.

Analysis and Results

Our primary baseline is real-world labor statistics from the BLS, which provide empirical gender distributions across different professions. For instance, using the labor statistics, if the first occupation is predominantly male-dominated whereas the second occupation is predominantly female-dominated, then the prediction based on labor statistics is occupation 1 if the gendered pronoun in the respective pronoun is "he", and occupation 2 is "she". We then compare to the response of the model, and aggregate the results. This allows us to compare whether model outputs reflect actual workforce distributions (something we would expect for the base models) or overcorrect due to the presence of bias (something we would expect for the instruct models). Our results were of this form:

model Sentence Model Prediction Labor Stat Prediction Occupation 1 (Female %) Occupation 2 (Female %)
mistralai_Mistral-7B-Instruct-v0.1 The chef and the server spoke before the shift since she was running late due to traffic. server server 0.216 0.674
mistralai_Mistral-7B-Instruct-v0.1 The server and the chef spoke before the shift since she was running late due to traffic. chef server 0.216 0.674
... ... ... ... ... ...

Overall Accuracy by Model

Overall Accuracy by Model

The graph presents the overall accuracy of each model in resolving pronoun references. Across all models, accuracy hovers around 60%, with slight variations between different model families. Notably, instruction-tuned variants did not exhibit a consistent accuracy advantage over their base counterparts. This suggests that instruction tuning does not necessarily enhance a model’s ability to resolve gendered pronouns correctly. Even so, this is fairly low accuracy - a likely hypothesis for why that is, is that ambiguous pronoun resolution is inherently difficult, particularly in within the dataset we constructed.

Bias by Occupation Type

By Model:

This first figure shows how each model performs when occupations are categorized as Balanced, Mostly Female, or Mostly Male based on real-world labor statistics. Overall, we see that most models fare slightly better on occupations that are strongly skewed to one gender—particularly those dominated by female workers—than in occupations with a balanced split. A possible explanation is that the model can more confidently latch onto stereotypical cues (e.g., “nurse,” “receptionist,” or “secretary”) when it knows these professions are historically female-majority. Conversely, balanced occupations (around a 50/50 split) might introduce more uncertainty, causing a drop in accuracy.

By Industry:

In our additional analysis, we aggregated occupations into ten broad categories, such as Arts/Media, Engineering, Healthcare, and Service. For each category, we computed how frequently the model’s predictions matched the real-world distributions implied by BLS data. The resulting heatmap contrasts instruct vs. base model behaviors.

Notably, finance-related occupations show a pronounced gap between instruct (about 0.72) and base (about 0.46) match rates. This discrepancy suggests that instruction-tuned models may be more attuned to real-world distributions in financially oriented roles, potentially reflecting a heightened awareness of industry-specific patterns during instruction tuning. Healthcare also exhibits relatively high match rates in the instruct variant (0.81) compared to base (0.75), implying that instruction prompts can help the model better align with strongly gendered professions such as nurse or physician. In general though, we see instruction-tuning responses perform equally, or in a slightly more biased fashion that base models across all industries (note: the difference in interpretation between this figure and the overall accuracy figure is a result of the number of occupations in each bin). This is slightly surprising - instead of overcorrecting for biases, in some cases, instruction-tuning may reinforce biases instead.

By Industry and Model:

Bias by Occupation Type 2

This graph categorizes professions by industry and separates accuracy results based on whether the profession is male-dominated, female-dominated, or balanced. Generally, we would expect the occupations with a higher percentage of a certain gender to correspond to more accurate predictions (for instance, if an occupation's gender split is 95% Men to 5% Women, we would expect the model to predict male more accurately than an occupation that was 55% Men to 45% Women). The results show notable trends in model performance across different gender distributions in the workforce. Notably, Mixtral and Qwen models most closely align with the idea that higher correctness should correspond to higher gender representation in a given industry suggesting that other models may be overcorrecting for biases.

Ambiguity in Model Responses

Ambiguity in Model Responses

This graph examines how often models return ambiguous responses—such as cases where no clear pronoun resolution is made or both professions are mentioned instead of selecting one. This may indicate model hesitation, deliberate avoidance of gendered assumptions, or an inability to resolve the pronoun given the available context.

We expected instruction-tuned models to provide clearer pronoun resolutions, but Qwen’s instruct variant displayed significantly more ambiguity than other models. This suggests that rather than reinforcing biases, some instruction tuning strategies may instead encourage hedging, leading to fewer definitive predictions in uncertain cases.

Statistical Analysis

McNemar's Test

The McNemar's test was performed to compare base and instruct pairs and see if they had significant differences in biases.

Hypotheses:

  • H0: The two models perform similarly. (McNemar's score < critical value)
  • H1: The two models perform differently. (McNemar's score $\geq$ critical value)
Model Pair McNemar's Score Hypothesis
Mistral 238 rejected
Mixtral 0.0078 accepted
Qwen 8.94 rejected

Through the McNemar's test, we determined that there was a difference between the biases of the Mistral base and instruct models and the Qwen base and instruct models, though it is important to note that the Mistral base model did not output any meaningful responses.

ANOVA Test

The ANOVA test was performed to compare all the models (other than the Mistral base model, since it did not output coherent responses) to see if they had significant differences in biases.

Hypotheses:

  • H0: All models perform the same. (p-value > 0.05)
  • H1: At least one model performs differently. (p-value $\leq$ 0.05)

The test resulted in a p-value of 0.62, so we fail to reject the null hypothesis. This means that there is no statistically significant evidence that any of the models have a different bias than the others.


Conclusion and Discussion

Our study's goal was to evaluate whether instruction tuning mitigates gender bias in LLMs by comparing base and instruct variants across multiple model families. Our analysis reveals that while instruction tuning does impact model behavior, it does not consistently reduce bias in a straightforward way.

Key Findings

  1. No Significant Accuracy Improvement: Instruction-tuned models did not show a significant increase in accuracy for pronoun resolution tasks compared to base models. This suggests that instruction tuning does not necessarily enhance models’ ability to resolve gendered pronouns correctly.
  2. Bias Correction is Inconsistent: In some cases, instruction tuning led to overcorrection—where models assigned pronouns in a way that deviated more from real-world labor statistics than base models. However, this was not uniform across all models.
  3. Ambiguity in Qwen: Qwen’s instruct variant returned significantly more ambiguous responses than other models. This could be attributed to the training data and objectives used in its instruction tuning process, which may prioritize hedging or avoiding gendered assumptions rather than aligning with statistical distributions.

Instruction Tuning’s Limited Impact on Dataset Bias

While instruction tuning refines model behavior, its influence may be overshadowed by the biases already present in the vast datasets used for pretraining. Since instruction tuning primarily adjusts how models respond to prompts rather than fundamentally reshaping their underlying representations, deeply ingrained biases may persist despite efforts to align model outputs. The results suggest that even with instruction tuning, model responses still reflect broader patterns in pretraining data rather than adopting a significantly fairer distribution.

Ultimately, our findings underscore that instruction tuning is not a universal fix for gender bias in LLMs. While it can influence model outputs in nuanced ways, its effectiveness depends on the training strategy, preexisting biases, and the specific task at hand. More comprehensive fairness interventions—beyond instruction tuning alone—are necessary to ensure equitable NLP systems.

Ethical considerations

Several ethical concerns are relevant:

  • Binary gender framework – Our study uses a male/female classification due to available workforce data, but this excludes non-binary identities. Future work should expand inclusivity.
  • Bias in real-world statistics – While BLS data provides an empirical reference, it reflects historical inequalities rather than an ideal unbiased distribution. Our goal is to measure bias, not justify it.
  • Transparency – All test sentences are synthetically generated, avoiding privacy risks. Our results will be openly documented to ensure clarity and reproducibility.

By acknowledging these challenges, we aim to frame our findings responsibly within broader AI fairness discussions.


Work Cited

[1] Ananya Malik. Evaluating large language models through gender and racial stereotypes. In Proceedings of the arXiv Conference on Computation and Language, 2023.

[2] Nicholas Carlini Zico Kolter Daphne Ippolito Yiming Zhang, Avi Schwarzschild. Forcing diffuse distributions out of language models. In Conference on Language Modeling, 2024.

[3] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.

[4] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018.