How to finetune a LLM - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Finetuning

Fine-tuning a large language model (LLM) means starting with a pre-trained model (GPT, Llama, Mistral, etc.) and training it further on your own data so the model better matches your needs (tone, domain jargon, task format, etc.). Conceptually you are “refining” an already fluent model rather than teaching it language from scratch.


1. Collect & prepare training data

Data type Typical size Format
Instruction → answer pairs 500 – 20 k rows JSONL, CSV
Chat transcripts 1 k – 100 k turns OpenAI “messages” or Alpaca format
Domain texts (DAPT) 10 M + tokens Plain text, one doc per line

Best practices

  • Curate high-quality answers – the model will copy them.
  • Deduplicate, remove copyrighted or PII content.
  • Keep the input/output style identical to production prompts.
  • For classification tasks store both the raw text and the desired label.

2. Select a fine-tuning framework and strategy

A fine-tuning framework is the “bread machine” that loads the model, feeds it your data, handles gradients and saves the new weights—so you don’t write hundreds of lines of PyTorch glue.

Framework

For beginners: Hugging Face Transformers

  • Most popular, best documentation
  • Works with almost all models
  • Built-in LoRA support via PEFT

For speed: Unsloth

  • 2-5x faster training
  • Great for consumer GPUs
  • Simple setup

For GPT models: OpenAI API

  • No local setup needed
  • Just upload data and pay per token
  • Limited to OpenAI's models

For large-scale: DeepSpeed

  • Multi-GPU distributed training
  • Complex but powerful
  • Overkill for most projects

Recommendation

Start with Hugging Face Transformers unless you have specific needs. It handles 90% of use cases and has the best learning resources.

Strategy (what you actually change)

Strategy Updated parameters Typical GPU RAM Strengths Trade-offs Popular libs / tools
LoRA / Q-LoRA 0.1 – 1 % 6 – 24 GB Cheap, fast, works on consumer GPUs, keeps base weights frozen Slight quality gap vs. full SFT on some tasks. Hugging Face PEFT, bitsandbytes
Adapters / Prefix-Tuning 0.1 – 1 % 6 – 24 GB Very small add-on weights (can be swaped/combined at runtime) Prefix-tuning inserts fixed tokens before each prompt, reducing the available context.. HF PEFT, AdapterHub
Full fine-tune (supervised fine-tune, SFT) 100 % ≈ 2 GB × (model B-params) (70 B → 140 GB) Maximum quality; can alter every weight. Expensive; risk of over-fitting or catastrophic forgetting. HF Trainer, DeepSpeed
Continue pre-training (domain-adaptive pre-training, DAPT) 100 % Same as SFT but often sharded across many GPUs Best when data is unlabeled but domain-specific (legal corpora, medical papers). Requires lots of compute. DeepSpeed-Megatron, MosaicML, Megatron-LM

Rule: Unless you know exactly why you need something else, use Hugging Face + LoRA.


3. Configure training hyperparameters

Fine-tuning requires setting several key parameters that control how the model learns from your data.

Essential parameters

Learning rate (how fast the model learns)

  • Full fine-tuning: 1e-5 to 5e-5 (small steps to avoid breaking existing knowledge)
  • LoRA: 1e-4 to 3e-4 (can be more aggressive since base model stays frozen)
  • Too high → model forgets everything, too low → barely learns anything

Batch size (how many examples to process at once)

  • Start with 1-4 per GPU (limited by memory)
  • Use gradient accumulation to simulate larger batches: effective_batch = batch_size × accumulation_steps

Epochs (how many times to see the entire dataset)

  • Usually 1-3 epochs is enough
  • More epochs risk overfitting (memorizing instead of learning)

Sequence length (maximum input/output length)

  • Match your expected use case (512 for short tasks, 2048+ for long documents)
  • Longer sequences need more GPU memory

LoRA-specific settings

  • Rank: 8-16 for most tasks (cheap), 32-64 for complex domains (expensive)
  • Alpha: Usually 2× the rank (r=16 → alpha=32)

The key is starting conservative and adjusting based on results – you can always retrain with different settings if needed.


4. Run and Monitor Fine-tuning

What happens during training

  • The model continues training on your dataset, adjusting its weights slightly
  • Loss curve should decrease smoothly (spikes = learning rate too high)
  • GPU memory usage stays constant
  • Validation metrics show if the model is actually learning

Key metrics to watch

  • Training loss: Should go down over time
  • Validation loss: Should also decrease but may plateau (if it goes up = overfitting)
  • Perplexity: Lower is better (measures how "surprised" the model is by test data)

Common issues

  • Loss explodes → reduce learning rate
  • Loss plateaus immediately → increase learning rate or check data format
  • OOM errors → reduce batch size, sequence length, or use gradient checkpointing
  • Model outputs garbage → check data preprocessing, ensure correct tokenizer

5. Evaluate

Qualitative evaluation

  • Test with real examples from your use case
  • Compare outputs to the base model
  • Check for overfitting (does it only work on training examples?)

Quantitative evaluation

  • Hold out 10-20% of data for testing
  • Measure task-specific metrics (accuracy, F1, etc.)
  • Run A/B tests against the base model

6. Deploy

Key considerations

  • Model size: LoRA adapters are tiny (MBs), full models are large (GBs)
  • Inference speed: Full fine-tuned models may be slower due to larger size
  • Version control: Track which adapter/model version is deployed
  • Rollback strategy: Keep base model as fallback

Deploy with Ollama

  • Export your model to GGUF format:
  • If using LoRA, merge it with the base model first
  • Convert to GGUF format using llama.cpp conversion tools
  • Choose quantization level (q4_K_M for balanced size/quality)
  • Create Ollama Modelfile (Define model path, temperature, context window size and system prompt if needed

7. Common pitfalls and how to iterate

Quick diagnosis guide

Problem Symptoms Fix
Bad training data Model outputs errors, weird formatting Review 100 samples, filter low-quality examples
Overfitting Perfect on training, fails on new inputs Use fewer epochs (1-2), add dropout, more diverse data
Catastrophic forgetting Can't do basic tasks anymore Use LoRA instead of full fine-tune, mix in general data (10-20%)
Out of memory CUDA OOM errors Enable gradient checkpointing, use QLoRA, reduce batch size

Iteration workflow

  1. Start small: 500 examples + LoRA → test results
  2. Scale what works: More data → higher rank → full fine-tune
  3. Monitor production: Log failures → add to training set → retrain monthly
  4. Know when to stop: < 2% improvement = diminishing returns

8. When NOT to fine-tune

Fine-tuning is expensive to build, run and maintain.
Before you pull that trigger, walk through this checklist - if you answer yes to any item, use a lighter technique instead.

Question Why it vetoes fine-tuning Better alternative
Can prompt engineering or a few shot examples already reach ≥ 90 % of the desired quality? You’re paying compute to hard-code what a prompt can express for free. System / few-shot prompts, JSON mode
Do you have fewer than 100–200 excellent examples? The model will memorize, not generalize. Few-shot prompting, RAG
Does the task depend on facts that change daily / weekly? Fine-tuning bakes information at train-time - outdated the moment you finish. RAG, function calling to live APIs
Is the goal to “teach” brand-new factual knowledge? FT mostly adjusts style; it does not inject large knowledge graphs. RAG, external knowledge bases
Will the model need to stay in sync with a frequently updated base model? Every upstream model release forces you to repeat the FT loop. Stick with stock model + prompts
Is the expected traffic low (≤ a few hundred calls / day)? Engineering overhead outweighs runtime gains. Prompting

Cheat-sheet of lighter methods

Need Drop-in solution
Domain-specific facts Retrieval-Augmented Generation (RAG)
Deterministic JSON / XML output Constrained decoding or structured prompts
Corporate tone / style System prompt + 2–3 in-context examples
Up-to-the-minute data Function calling ↔ APIs / databases
Multistep workflows Orchestration frameworks (LangChain, LangGraph, semantic-kernels)

The golden rule:

Fine-tune only when you need the model to consistently behave differently across thousands of interactions, and simpler methods have failed. It's a powerful tool, but often a sledgehammer where a scalpel would do.


9. Fine-tuning for the AMOS AI-Driven Testing Project

Project Context

The AMOS AI-Driven Testing project develops an AI-powered testing system that automatically generates unit tests for Python code. The system:

  • Uses various pre-trained LLMs (Mistral, DeepSeek, Qwen, etc.) via Ollama
  • Generates structured unit tests in pytest/unittest format
  • Evaluates code complexity (CCC/MCC) before and after test generation
  • Collects performance metrics including syntax validity and generation time
  • Provides a modular architecture for extensions
  • Runs on-premise without external API dependencies

Should We Fine-tune? Probably NOT

After analyzing the project requirements and current capabilities, multiple factors argue against fine-tuning:

The "No Fine-tuning" Checklist

Criterion Applies to AMOS? Explanation
Prompt engineering achieves >90% quality ✅ YES Generated tests are already structurally correct and executable
Less than 100-200 excellent examples ✅ YES We have many test cases but lack curated "perfect" training pairs
Task is specific and structured ✅ YES Unit test generation follows clear patterns (import → class → test methods)
Low expected traffic ✅ YES Academic project, not millions of requests
Frequent base model updates ✅ YES The Ollama models we use are regularly updated and improved, which means our fine-tuned models can quickly become outdated

Better Alternatives for Our Project

Instead of fine-tuning, we should leverage lighter-weight methods:

Improvement Area Current Status Recommended Solution
Prompt Optimization Basic prompts implemented Systematic prompt engineering with few-shot examples
Context Enhancement Context-size calculator available RAG system for similar test patterns
Quality Control Syntax validation implemented Extended post-processing modules (Compiler + Test Runner)
Domain Patterns Single function handling Template-based generation for different test types
Data Collection Test examples in ExampleTests/ Curate high-quality examples for future use

LLM Chaining: A Superior Alternative

Rather than fine-tuning a single model, implementing LLM chaining as described in our Benefits of Chaining LLMs documentation could yield better results. This approach offers:

  • Iterative refinement through compiler and test runner feedback loops
  • Specialized models for different subtasks (test generation, error fixing, merging)
  • Measurable improvements with mutation coverage increasing from 0% to >80%
  • No training required - works with existing pre-trained models

The chaining architecture with feedback loops naturally addresses many quality issues that fine-tuning attempts to solve, but without the associated overhead of training, data curation, and model maintenance. And the chance of an improvement in performance is significantly higher in the case of LLM Changing than with LLM Fine-tuning for our use case.

Recommendation Timeline

Short-term (Current Sprints)

  • Focus on prompt optimization and RAG implementation
  • Enhance post-processing modules
  • Collect and manually curate high-quality test examples
  • No fine-tuning investment

Medium-term (After Data Collection)

  • If we accumulate 1000+ verified examples, consider a LoRA pilot on a small model (e.g., Qwen2.5-Coder 3B)
  • Success criterion: ≥5% improvement in test pass rate vs. prompt-only approach

Long-term

  • Full fine-tuning only if we have 10k+ verified examples
  • Consider domain-adaptive pre-training for Python testing specifically

10. Conclusion

For the AMOS AI-Driven Testing project, fine-tuning represents engineering overhead without proportional benefits. The structured nature of unit test generation, already good baseline performance, and academic project constraints all argue against the complexity of a fine-tuning setup.

Final Recommendation: Focus efforts on prompt engineering, RAG integration, and enhanced post-processing modules. These approaches are:

  • Faster to implement
  • More flexible to adapt
  • Require no special hardware
  • Likely to deliver better results for our specific use case

Fine-tuning remains a future option once we have accumulated sufficient high-quality training data, but it should not be a priority for the current project phase.