fine‐tune models - chunhualiao/public-docs GitHub Wiki

Fine-tuning Gemma 3 27B (27 billion parameters) requires a substantial amount of GPU memory, and the exact requirement depends on the fine-tuning method, sequence length, batch size, and precision (e.g., fp16, bf16, or quantized formats).

Here’s a detailed breakdown based on different approaches:

⸻

🔹 1. Full Fine-tuning (FP16 or BF16)

This is the most GPU-intensive. • Memory per parameter: 2 bytes (fp16) → ~54 GB just for model weights. • Additional memory (activations, gradients, optimizer states): Typically 3–4x model size. • Total GPU memory needed: ~200–250 GB (per GPU) ➤ Typically requires multiple A100 80GB GPUs (e.g., 4–8 A100 80GBs or more)

⸻

🔹 2. Parameter-Efficient Fine-Tuning (PEFT) – e.g., LoRA

This reduces memory usage significantly by freezing most weights and training small adapters. • Estimated GPU memory: 48–80 GB, depending on batch size and sequence length ➤ Can run on 1–2 A100 80GB or similar GPUs (e.g., H100, MI300X) • If you use 4-bit quantization (QLoRA): Even 40 GB GPUs (like A6000, 4090, or H100 40GB) can handle it with small batch sizes.

⸻

🔹 3. Inference (not fine-tuning)

Just for context: • Can often run on 1 A100 80GB or split across multiple 40GB GPUs.

⸻

Summary Table:

Fine-tuning Type GPU Memory Needed Notes Full Fine-tuning 200–250 GB Requires multi-GPU setup LoRA (16-bit) 48–80 GB 1–2 high-end GPUs LoRA (4-bit QLoRA) 24–48 GB Consumer GPUs possible Inference (16-bit) 64–80 GB Usually single A100/H100

⸻

Would you like a setup suggestion for your current GPU resources or help running LoRA/QLoRA fine-tuning?