How to finetune a LLM - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Finetuning

Fine-tuning a large language model (LLM) means starting with a pre-trained model (GPT, Llama, Mistral, etc.) and training it further on your own data so the model better matches your needs (tone, domain jargon, task format, etc.). Conceptually you are “refining” an already fluent model rather than teaching it language from scratch.

1. Collect & prepare training data

Data type	Typical size	Format
Instruction → answer pairs	500 – 20 k rows	JSONL, CSV
Chat transcripts	1 k – 100 k turns	OpenAI “messages” or Alpaca format
Domain texts (DAPT)	10 M + tokens	Plain text, one doc per line

Best practices

Curate high-quality answers – the model will copy them.
Deduplicate, remove copyrighted or PII content.
Keep the input/output style identical to production prompts.
For classification tasks store both the raw text and the desired label.

2. Select a fine-tuning framework and strategy

A fine-tuning framework is the “bread machine” that loads the model, feeds it your data, handles gradients and saves the new weights—so you don’t write hundreds of lines of PyTorch glue.

Framework

For beginners: Hugging Face Transformers

Most popular, best documentation
Works with almost all models
Built-in LoRA support via PEFT

For speed: Unsloth

2-5x faster training
Great for consumer GPUs
Simple setup

For GPT models: OpenAI API

No local setup needed
Just upload data and pay per token
Limited to OpenAI's models

For large-scale: DeepSpeed

Multi-GPU distributed training
Complex but powerful
Overkill for most projects

Recommendation

Start with Hugging Face Transformers unless you have specific needs. It handles 90% of use cases and has the best learning resources.

Strategy (what you actually change)

Strategy	Updated parameters	Typical GPU RAM	Strengths	Trade-offs	Popular libs / tools
LoRA / Q-LoRA	0.1 – 1 %	6 – 24 GB	Cheap, fast, works on consumer GPUs, keeps base weights frozen	Slight quality gap vs. full SFT on some tasks.	Hugging Face PEFT, bitsandbytes
Adapters / Prefix-Tuning	0.1 – 1 %	6 – 24 GB	Very small add-on weights (can be swaped/combined at runtime)	Prefix-tuning inserts fixed tokens before each prompt, reducing the available context..	HF PEFT, AdapterHub
Full fine-tune (supervised fine-tune, SFT)	100 %	≈ 2 GB × (model B-params) (70 B → 140 GB)	Maximum quality; can alter every weight.	Expensive; risk of over-fitting or catastrophic forgetting.	HF Trainer, DeepSpeed
Continue pre-training (domain-adaptive pre-training, DAPT)	100 %	Same as SFT but often sharded across many GPUs	Best when data is unlabeled but domain-specific (legal corpora, medical papers).	Requires lots of compute.	DeepSpeed-Megatron, MosaicML, Megatron-LM

Rule: Unless you know exactly why you need something else, use Hugging Face + LoRA.

3. Configure training hyperparameters

Fine-tuning requires setting several key parameters that control how the model learns from your data.

Essential parameters

Learning rate (how fast the model learns)

Full fine-tuning: 1e-5 to 5e-5 (small steps to avoid breaking existing knowledge)
LoRA: 1e-4 to 3e-4 (can be more aggressive since base model stays frozen)
Too high → model forgets everything, too low → barely learns anything

Batch size (how many examples to process at once)

Start with 1-4 per GPU (limited by memory)
Use gradient accumulation to simulate larger batches: effective_batch = batch_size × accumulation_steps

Epochs (how many times to see the entire dataset)

Usually 1-3 epochs is enough
More epochs risk overfitting (memorizing instead of learning)

Sequence length (maximum input/output length)

Match your expected use case (512 for short tasks, 2048+ for long documents)
Longer sequences need more GPU memory

LoRA-specific settings

Rank: 8-16 for most tasks (cheap), 32-64 for complex domains (expensive)
Alpha: Usually 2× the rank (r=16 → alpha=32)

The key is starting conservative and adjusting based on results – you can always retrain with different settings if needed.

4. Run and Monitor Fine-tuning

What happens during training

The model continues training on your dataset, adjusting its weights slightly
Loss curve should decrease smoothly (spikes = learning rate too high)
GPU memory usage stays constant
Validation metrics show if the model is actually learning

Key metrics to watch

Training loss: Should go down over time
Validation loss: Should also decrease but may plateau (if it goes up = overfitting)
Perplexity: Lower is better (measures how "surprised" the model is by test data)

Common issues

Loss explodes → reduce learning rate
Loss plateaus immediately → increase learning rate or check data format
OOM errors → reduce batch size, sequence length, or use gradient checkpointing
Model outputs garbage → check data preprocessing, ensure correct tokenizer

5. Evaluate

Qualitative evaluation

Test with real examples from your use case
Compare outputs to the base model
Check for overfitting (does it only work on training examples?)

Quantitative evaluation

Hold out 10-20% of data for testing
Measure task-specific metrics (accuracy, F1, etc.)
Run A/B tests against the base model

6. Deploy

Key considerations

Model size: LoRA adapters are tiny (MBs), full models are large (GBs)
Inference speed: Full fine-tuned models may be slower due to larger size
Version control: Track which adapter/model version is deployed
Rollback strategy: Keep base model as fallback

Deploy with Ollama

Export your model to GGUF format:
If using LoRA, merge it with the base model first
Convert to GGUF format using llama.cpp conversion tools
Choose quantization level (q4_K_M for balanced size/quality)
Create Ollama Modelfile (Define model path, temperature, context window size and system prompt if needed

7. Common pitfalls and how to iterate

Quick diagnosis guide

Problem	Symptoms	Fix
Bad training data	Model outputs errors, weird formatting	Review 100 samples, filter low-quality examples
Overfitting	Perfect on training, fails on new inputs	Use fewer epochs (1-2), add dropout, more diverse data
Catastrophic forgetting	Can't do basic tasks anymore	Use LoRA instead of full fine-tune, mix in general data (10-20%)
Out of memory	CUDA OOM errors	Enable gradient checkpointing, use QLoRA, reduce batch size

Iteration workflow

Start small: 500 examples + LoRA → test results
Scale what works: More data → higher rank → full fine-tune
Monitor production: Log failures → add to training set → retrain monthly
Know when to stop: < 2% improvement = diminishing returns

8. When NOT to fine-tune

Fine-tuning is expensive to build, run and maintain.
Before you pull that trigger, walk through this checklist - if you answer yes to any item, use a lighter technique instead.

Question	Why it vetoes fine-tuning	Better alternative
Can prompt engineering or a few shot examples already reach ≥ 90 % of the desired quality?	You’re paying compute to hard-code what a prompt can express for free.	System / few-shot prompts, JSON mode
Do you have fewer than 100–200 excellent examples?	The model will memorize, not generalize.	Few-shot prompting, RAG
Does the task depend on facts that change daily / weekly?	Fine-tuning bakes information at train-time - outdated the moment you finish.	RAG, function calling to live APIs
Is the goal to “teach” brand-new factual knowledge?	FT mostly adjusts style; it does not inject large knowledge graphs.	RAG, external knowledge bases
Will the model need to stay in sync with a frequently updated base model?	Every upstream model release forces you to repeat the FT loop.	Stick with stock model + prompts
Is the expected traffic low (≤ a few hundred calls / day)?	Engineering overhead outweighs runtime gains.	Prompting

Cheat-sheet of lighter methods

Need	Drop-in solution
Domain-specific facts	Retrieval-Augmented Generation (RAG)
Deterministic JSON / XML output	Constrained decoding or structured prompts
Corporate tone / style	System prompt + 2–3 in-context examples
Up-to-the-minute data	Function calling ↔ APIs / databases
Multistep workflows	Orchestration frameworks (LangChain, LangGraph, semantic-kernels)

The golden rule:

Fine-tune only when you need the model to consistently behave differently across thousands of interactions, and simpler methods have failed. It's a powerful tool, but often a sledgehammer where a scalpel would do.

9. Fine-tuning for the AMOS AI-Driven Testing Project

Project Context

The AMOS AI-Driven Testing project develops an AI-powered testing system that automatically generates unit tests for Python code. The system:

Uses various pre-trained LLMs (Mistral, DeepSeek, Qwen, etc.) via Ollama
Generates structured unit tests in pytest/unittest format
Evaluates code complexity (CCC/MCC) before and after test generation
Collects performance metrics including syntax validity and generation time
Provides a modular architecture for extensions
Runs on-premise without external API dependencies

Should We Fine-tune? Probably NOT ❌

After analyzing the project requirements and current capabilities, multiple factors argue against fine-tuning:

The "No Fine-tuning" Checklist

Criterion	Applies to AMOS?	Explanation
Prompt engineering achieves >90% quality	✅ YES	Generated tests are already structurally correct and executable
Less than 100-200 excellent examples	✅ YES	We have many test cases but lack curated "perfect" training pairs
Task is specific and structured	✅ YES	Unit test generation follows clear patterns (import → class → test methods)
Low expected traffic	✅ YES	Academic project, not millions of requests
Frequent base model updates	✅ YES	The Ollama models we use are regularly updated and improved, which means our fine-tuned models can quickly become outdated

Better Alternatives for Our Project

Instead of fine-tuning, we should leverage lighter-weight methods:

Improvement Area	Current Status	Recommended Solution
Prompt Optimization	Basic prompts implemented	Systematic prompt engineering with few-shot examples
Context Enhancement	Context-size calculator available	RAG system for similar test patterns
Quality Control	Syntax validation implemented	Extended post-processing modules (Compiler + Test Runner)
Domain Patterns	Single function handling	Template-based generation for different test types
Data Collection	Test examples in `ExampleTests/`	Curate high-quality examples for future use

LLM Chaining: A Superior Alternative

Rather than fine-tuning a single model, implementing LLM chaining as described in our Benefits of Chaining LLMs documentation could yield better results. This approach offers:

Iterative refinement through compiler and test runner feedback loops
Specialized models for different subtasks (test generation, error fixing, merging)
Measurable improvements with mutation coverage increasing from 0% to >80%
No training required - works with existing pre-trained models

The chaining architecture with feedback loops naturally addresses many quality issues that fine-tuning attempts to solve, but without the associated overhead of training, data curation, and model maintenance. And the chance of an improvement in performance is significantly higher in the case of LLM Changing than with LLM Fine-tuning for our use case.

Recommendation Timeline

Short-term (Current Sprints)

Focus on prompt optimization and RAG implementation
Enhance post-processing modules
Collect and manually curate high-quality test examples
No fine-tuning investment

Medium-term (After Data Collection)

If we accumulate 1000+ verified examples, consider a LoRA pilot on a small model (e.g., Qwen2.5-Coder 3B)
Success criterion: ≥5% improvement in test pass rate vs. prompt-only approach

Long-term

Full fine-tuning only if we have 10k+ verified examples
Consider domain-adaptive pre-training for Python testing specifically

10. Conclusion

For the AMOS AI-Driven Testing project, fine-tuning represents engineering overhead without proportional benefits. The structured nature of unit test generation, already good baseline performance, and academic project constraints all argue against the complexity of a fine-tuning setup.

Final Recommendation: Focus efforts on prompt engineering, RAG integration, and enhanced post-processing modules. These approaches are:

Faster to implement
More flexible to adapt
Require no special hardware
Likely to deliver better results for our specific use case

Fine-tuning remains a future option once we have accumulated sufficient high-quality training data, but it should not be a priority for the current project phase.