Fine Tuning - uw-ssec/llmaven GitHub Wiki
🛠️ Fine-Tuning
Overview
Fine-tuning is a potential next step in adapting large language models (LLMs) for specialized use within the LLMaven platform. At the Scientific Software Engineering Center (SSEC), we are exploring the value of fine-tuning a model or set of models specifically for Research Software Engineers (RSEs). These models could act as dedicated agents, guardrails, or both—integrating seamlessly within the broader agentic system.
🔍 Why Fine-Tune?
Research Software Engineering presents a unique set of challenges that generic models often do not understand well. Domain-specific language, workflows, and expectations require tailored model behavior. By fine-tuning, we aim to:
- Improve task relevance and specificity
- Provide more accurate and context-aware outputs
- Create agents that better reflect the values and patterns of RSE work
We are in active discussions about the potential use cases for maximizing impact.
🧪 Current Exploration and Planning
We are currently in the exploratory phase of this effort. No fine-tuning has begun yet. Our discussions so far include:
- Initial Idea: Fine-tune an LLM on Research Software Engineering tasks, allowing the model to serve as a standalone agent or as a guardrail.
- Critical Task Mapping: Identify what kinds of tasks and outputs we expect from an RSE-aware LLM (e.g., summarization, semantic diffs, code review suggestions).
- Data Source Strategy:
- Use GitHub issues as inputs
- Use associated pull requests as outputs
- Apply filtering/cleaning to ensure quality and relevance
🧱 Evaluating Base Models
Carlos proposed starting by evaluating a small set of open-source base models to understand how they perform on representative tasks. This includes:
- Selecting general-purpose models and testing them on RSE-aligned input/output tasks
- Comparing their performance to how an RSE might approach the problem
- Conducting human evaluations to identify where models succeed or fall short
If deficiencies are found, fine-tuning will be considered to improve specific areas.
🧠 Inspiration from Prior Work
Carlos previously fine-tuned a domain-specific embedding model in a B2B sales setting, which improved the relevance of word associations significantly. The same logic may apply here:
- In general-purpose models, the word “deck” might relate to “patio” or “cards”
- In the domain-specific version, “deck” mapped to “slides” or “presentation”
We believe scientific software engineering has a similarly unique vocabulary and structure that could benefit from domain tuning.
🧭 What Comes Next
The following milestones are under discussion:
- Model Selection: Shortlist a few performant open models to benchmark.
- Task Benchmarking: Define representative RSE tasks and evaluate model performance.
- Human Evaluation: Use researcher review to identify gaps.
- Fine-Tuning Decision: Determine whether fine-tuning is necessary based on results.
- Data Strategy Expansion: Once targets are known, curate training data specifically for weak areas.
🔮 Future Directions
We are also exploring the potential of a context sufficiency module, which would:
- Evaluate whether current memory/context provides enough information for the task
- Act as a validation layer before agent execution
- Help guide user interaction or agent decision-making when information is incomplete
This would improve the agent’s ability to ask clarifying questions or initiate fallback flows.