Fine Tuning - uw-ssec/llmaven GitHub Wiki

🛠️ Fine-Tuning

Overview

Fine-tuning is a potential next step in adapting large language models (LLMs) for specialized use within the LLMaven platform. At the Scientific Software Engineering Center (SSEC), we are exploring the value of fine-tuning a model or set of models specifically for Research Software Engineers (RSEs). These models could act as dedicated agents, guardrails, or both—integrating seamlessly within the broader agentic system.

🔍 Why Fine-Tune?

Research Software Engineering presents a unique set of challenges that generic models often do not understand well. Domain-specific language, workflows, and expectations require tailored model behavior. By fine-tuning, we aim to:

  • Improve task relevance and specificity
  • Provide more accurate and context-aware outputs
  • Create agents that better reflect the values and patterns of RSE work

We are in active discussions about the potential use cases for maximizing impact.

🧪 Current Exploration and Planning

We are currently in the exploratory phase of this effort. No fine-tuning has begun yet. Our discussions so far include:

  • Initial Idea: Fine-tune an LLM on Research Software Engineering tasks, allowing the model to serve as a standalone agent or as a guardrail.
  • Critical Task Mapping: Identify what kinds of tasks and outputs we expect from an RSE-aware LLM (e.g., summarization, semantic diffs, code review suggestions).
  • Data Source Strategy:
  • Use GitHub issues as inputs
  • Use associated pull requests as outputs
  • Apply filtering/cleaning to ensure quality and relevance

🧱 Evaluating Base Models

Carlos proposed starting by evaluating a small set of open-source base models to understand how they perform on representative tasks. This includes:

  • Selecting general-purpose models and testing them on RSE-aligned input/output tasks
  • Comparing their performance to how an RSE might approach the problem
  • Conducting human evaluations to identify where models succeed or fall short

If deficiencies are found, fine-tuning will be considered to improve specific areas.

🧠 Inspiration from Prior Work

Carlos previously fine-tuned a domain-specific embedding model in a B2B sales setting, which improved the relevance of word associations significantly. The same logic may apply here:

  • In general-purpose models, the word “deck” might relate to “patio” or “cards”
  • In the domain-specific version, “deck” mapped to “slides” or “presentation”

We believe scientific software engineering has a similarly unique vocabulary and structure that could benefit from domain tuning.

🧭 What Comes Next

The following milestones are under discussion:

  1. Model Selection: Shortlist a few performant open models to benchmark.
  2. Task Benchmarking: Define representative RSE tasks and evaluate model performance.
  3. Human Evaluation: Use researcher review to identify gaps.
  4. Fine-Tuning Decision: Determine whether fine-tuning is necessary based on results.
  5. Data Strategy Expansion: Once targets are known, curate training data specifically for weak areas.

🔮 Future Directions

We are also exploring the potential of a context sufficiency module, which would:

  • Evaluate whether current memory/context provides enough information for the task
  • Act as a validation layer before agent execution
  • Help guide user interaction or agent decision-making when information is incomplete

This would improve the agent’s ability to ask clarifying questions or initiate fallback flows.