How to finetune a LLM - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
Finetuning
Fine-tuning a large language model (LLM) means starting with a pre-trained model (GPT, Llama, Mistral, etc.) and training it further on your own data so the model better matches your needs (tone, domain jargon, task format, etc.). Conceptually you are “refining” an already fluent model rather than teaching it language from scratch.
1. Collect & prepare training data
Data type | Typical size | Format |
---|---|---|
Instruction → answer pairs | 500 – 20 k rows | JSONL, CSV |
Chat transcripts | 1 k – 100 k turns | OpenAI “messages” or Alpaca format |
Domain texts (DAPT) | 10 M + tokens | Plain text, one doc per line |
Best practices
- Curate high-quality answers – the model will copy them.
- Deduplicate, remove copyrighted or PII content.
- Keep the input/output style identical to production prompts.
- For classification tasks store both the raw text and the desired label.
2. Select a fine-tuning framework and strategy
A fine-tuning framework is the “bread machine” that loads the model, feeds it your data, handles gradients and saves the new weights—so you don’t write hundreds of lines of PyTorch glue.
Framework
For beginners: Hugging Face Transformers
- Most popular, best documentation
- Works with almost all models
- Built-in LoRA support via PEFT
For speed: Unsloth
- 2-5x faster training
- Great for consumer GPUs
- Simple setup
For GPT models: OpenAI API
- No local setup needed
- Just upload data and pay per token
- Limited to OpenAI's models
For large-scale: DeepSpeed
- Multi-GPU distributed training
- Complex but powerful
- Overkill for most projects
Recommendation
Start with Hugging Face Transformers unless you have specific needs. It handles 90% of use cases and has the best learning resources.
Strategy (what you actually change)
Strategy | Updated parameters | Typical GPU RAM | Strengths | Trade-offs | Popular libs / tools |
---|---|---|---|---|---|
LoRA / Q-LoRA | 0.1 – 1 % | 6 – 24 GB | Cheap, fast, works on consumer GPUs, keeps base weights frozen | Slight quality gap vs. full SFT on some tasks. | Hugging Face PEFT, bitsandbytes |
Adapters / Prefix-Tuning | 0.1 – 1 % | 6 – 24 GB | Very small add-on weights (can be swaped/combined at runtime) | Prefix-tuning inserts fixed tokens before each prompt, reducing the available context.. | HF PEFT, AdapterHub |
Full fine-tune (supervised fine-tune, SFT) | 100 % | ≈ 2 GB × (model B-params) (70 B → 140 GB) | Maximum quality; can alter every weight. | Expensive; risk of over-fitting or catastrophic forgetting. | HF Trainer, DeepSpeed |
Continue pre-training (domain-adaptive pre-training, DAPT) | 100 % | Same as SFT but often sharded across many GPUs | Best when data is unlabeled but domain-specific (legal corpora, medical papers). | Requires lots of compute. | DeepSpeed-Megatron, MosaicML, Megatron-LM |
Rule: Unless you know exactly why you need something else, use Hugging Face + LoRA.
3. Configure training hyperparameters
Fine-tuning requires setting several key parameters that control how the model learns from your data.
Essential parameters
Learning rate (how fast the model learns)
- Full fine-tuning: 1e-5 to 5e-5 (small steps to avoid breaking existing knowledge)
- LoRA: 1e-4 to 3e-4 (can be more aggressive since base model stays frozen)
- Too high → model forgets everything, too low → barely learns anything
Batch size (how many examples to process at once)
- Start with 1-4 per GPU (limited by memory)
- Use gradient accumulation to simulate larger batches:
effective_batch = batch_size × accumulation_steps
Epochs (how many times to see the entire dataset)
- Usually 1-3 epochs is enough
- More epochs risk overfitting (memorizing instead of learning)
Sequence length (maximum input/output length)
- Match your expected use case (512 for short tasks, 2048+ for long documents)
- Longer sequences need more GPU memory
LoRA-specific settings
- Rank: 8-16 for most tasks (cheap), 32-64 for complex domains (expensive)
- Alpha: Usually 2× the rank (r=16 → alpha=32)
The key is starting conservative and adjusting based on results – you can always retrain with different settings if needed.
4. Run and Monitor Fine-tuning
What happens during training
- The model continues training on your dataset, adjusting its weights slightly
- Loss curve should decrease smoothly (spikes = learning rate too high)
- GPU memory usage stays constant
- Validation metrics show if the model is actually learning
Key metrics to watch
- Training loss: Should go down over time
- Validation loss: Should also decrease but may plateau (if it goes up = overfitting)
- Perplexity: Lower is better (measures how "surprised" the model is by test data)
Common issues
- Loss explodes → reduce learning rate
- Loss plateaus immediately → increase learning rate or check data format
- OOM errors → reduce batch size, sequence length, or use gradient checkpointing
- Model outputs garbage → check data preprocessing, ensure correct tokenizer
5. Evaluate
Qualitative evaluation
- Test with real examples from your use case
- Compare outputs to the base model
- Check for overfitting (does it only work on training examples?)
Quantitative evaluation
- Hold out 10-20% of data for testing
- Measure task-specific metrics (accuracy, F1, etc.)
- Run A/B tests against the base model
6. Deploy
Key considerations
- Model size: LoRA adapters are tiny (MBs), full models are large (GBs)
- Inference speed: Full fine-tuned models may be slower due to larger size
- Version control: Track which adapter/model version is deployed
- Rollback strategy: Keep base model as fallback
Deploy with Ollama
- Export your model to GGUF format:
- If using LoRA, merge it with the base model first
- Convert to GGUF format using llama.cpp conversion tools
- Choose quantization level (q4_K_M for balanced size/quality)
- Create Ollama Modelfile (Define model path, temperature, context window size and system prompt if needed
7. Common pitfalls and how to iterate
Quick diagnosis guide
Problem | Symptoms | Fix |
---|---|---|
Bad training data | Model outputs errors, weird formatting | Review 100 samples, filter low-quality examples |
Overfitting | Perfect on training, fails on new inputs | Use fewer epochs (1-2), add dropout, more diverse data |
Catastrophic forgetting | Can't do basic tasks anymore | Use LoRA instead of full fine-tune, mix in general data (10-20%) |
Out of memory | CUDA OOM errors | Enable gradient checkpointing, use QLoRA, reduce batch size |
Iteration workflow
- Start small: 500 examples + LoRA → test results
- Scale what works: More data → higher rank → full fine-tune
- Monitor production: Log failures → add to training set → retrain monthly
- Know when to stop: < 2% improvement = diminishing returns
8. When NOT to fine-tune
Fine-tuning is expensive to build, run and maintain.
Before you pull that trigger, walk through this checklist - if you answer yes to any item, use a lighter technique instead.
Question | Why it vetoes fine-tuning | Better alternative |
---|---|---|
Can prompt engineering or a few shot examples already reach ≥ 90 % of the desired quality? | You’re paying compute to hard-code what a prompt can express for free. | System / few-shot prompts, JSON mode |
Do you have fewer than 100–200 excellent examples? | The model will memorize, not generalize. | Few-shot prompting, RAG |
Does the task depend on facts that change daily / weekly? | Fine-tuning bakes information at train-time - outdated the moment you finish. | RAG, function calling to live APIs |
Is the goal to “teach” brand-new factual knowledge? | FT mostly adjusts style; it does not inject large knowledge graphs. | RAG, external knowledge bases |
Will the model need to stay in sync with a frequently updated base model? | Every upstream model release forces you to repeat the FT loop. | Stick with stock model + prompts |
Is the expected traffic low (≤ a few hundred calls / day)? | Engineering overhead outweighs runtime gains. | Prompting |
Cheat-sheet of lighter methods
Need | Drop-in solution |
---|---|
Domain-specific facts | Retrieval-Augmented Generation (RAG) |
Deterministic JSON / XML output | Constrained decoding or structured prompts |
Corporate tone / style | System prompt + 2–3 in-context examples |
Up-to-the-minute data | Function calling ↔ APIs / databases |
Multistep workflows | Orchestration frameworks (LangChain, LangGraph, semantic-kernels) |
The golden rule:
Fine-tune only when you need the model to consistently behave differently across thousands of interactions, and simpler methods have failed. It's a powerful tool, but often a sledgehammer where a scalpel would do.
9. Fine-tuning for the AMOS AI-Driven Testing Project
Project Context
The AMOS AI-Driven Testing project develops an AI-powered testing system that automatically generates unit tests for Python code. The system:
- Uses various pre-trained LLMs (Mistral, DeepSeek, Qwen, etc.) via Ollama
- Generates structured unit tests in pytest/unittest format
- Evaluates code complexity (CCC/MCC) before and after test generation
- Collects performance metrics including syntax validity and generation time
- Provides a modular architecture for extensions
- Runs on-premise without external API dependencies
Should We Fine-tune? Probably NOT ❌
After analyzing the project requirements and current capabilities, multiple factors argue against fine-tuning:
The "No Fine-tuning" Checklist
Criterion | Applies to AMOS? | Explanation |
---|---|---|
Prompt engineering achieves >90% quality | ✅ YES | Generated tests are already structurally correct and executable |
Less than 100-200 excellent examples | ✅ YES | We have many test cases but lack curated "perfect" training pairs |
Task is specific and structured | ✅ YES | Unit test generation follows clear patterns (import → class → test methods) |
Low expected traffic | ✅ YES | Academic project, not millions of requests |
Frequent base model updates | ✅ YES | The Ollama models we use are regularly updated and improved, which means our fine-tuned models can quickly become outdated |
Better Alternatives for Our Project
Instead of fine-tuning, we should leverage lighter-weight methods:
Improvement Area | Current Status | Recommended Solution |
---|---|---|
Prompt Optimization | Basic prompts implemented | Systematic prompt engineering with few-shot examples |
Context Enhancement | Context-size calculator available | RAG system for similar test patterns |
Quality Control | Syntax validation implemented | Extended post-processing modules (Compiler + Test Runner) |
Domain Patterns | Single function handling | Template-based generation for different test types |
Data Collection | Test examples in ExampleTests/ |
Curate high-quality examples for future use |
LLM Chaining: A Superior Alternative
Rather than fine-tuning a single model, implementing LLM chaining as described in our Benefits of Chaining LLMs documentation could yield better results. This approach offers:
- Iterative refinement through compiler and test runner feedback loops
- Specialized models for different subtasks (test generation, error fixing, merging)
- Measurable improvements with mutation coverage increasing from 0% to >80%
- No training required - works with existing pre-trained models
The chaining architecture with feedback loops naturally addresses many quality issues that fine-tuning attempts to solve, but without the associated overhead of training, data curation, and model maintenance. And the chance of an improvement in performance is significantly higher in the case of LLM Changing than with LLM Fine-tuning for our use case.
Recommendation Timeline
Short-term (Current Sprints)
- Focus on prompt optimization and RAG implementation
- Enhance post-processing modules
- Collect and manually curate high-quality test examples
- No fine-tuning investment
Medium-term (After Data Collection)
- If we accumulate 1000+ verified examples, consider a LoRA pilot on a small model (e.g., Qwen2.5-Coder 3B)
- Success criterion: ≥5% improvement in test pass rate vs. prompt-only approach
Long-term
- Full fine-tuning only if we have 10k+ verified examples
- Consider domain-adaptive pre-training for Python testing specifically
10. Conclusion
For the AMOS AI-Driven Testing project, fine-tuning represents engineering overhead without proportional benefits. The structured nature of unit test generation, already good baseline performance, and academic project constraints all argue against the complexity of a fine-tuning setup.
Final Recommendation: Focus efforts on prompt engineering, RAG integration, and enhanced post-processing modules. These approaches are:
- Faster to implement
- More flexible to adapt
- Require no special hardware
- Likely to deliver better results for our specific use case
Fine-tuning remains a future option once we have accumulated sufficient high-quality training data, but it should not be a priority for the current project phase.