Comparison of 1b, 3b, 7b and 14b models - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
Comparison of LLM Sizes and Performance
This page documents the performance, resource usage, and test results of various large language models (LLMs) of different sizes, ranging from 1B to 14B parameters.
Evaluated Models
| Metric/Characteristic | 1B Models | 3B Models | 7B Models | 14B Models |
|---|---|---|---|---|
| Model | tinyllama:1.1b |
qwen2.5-coder:3b-instruct-q8_0 |
mistral_7b-instruct-v0.3-q3_K_M |
phi4-reasoning |
| Generation time | 35.85 s | 96.0 s | 207.65 s | 3129.37 s |
| CCC (Calculated output test score) | 67 | 102 | 32 | 0 |
| MCC (Calculated output test score) | 5 | 9 | 4 | 0 |
| Max memory (RAM) used by Docker container | 0.98 GB | 3.90 GB | 4.9 GB | 13.20 GB |
| Total downloaded memory (LLM image) | 638 MB | 3.29 GB | 3.52 GB | 11.1 GB |
| GitHub Actions run | Link | Link | Link | Link |
Summary
- Smaller models (1B–3B) are significantly faster and more efficient in memory consumption, while still achieving usable output quality (CCC/MCC).
- Larger models (7B–14B) require much more memory and time, but did not show better test performance in this case.
- Phi-4 (14B) had the highest resource demand but failed the test case evaluations, resulting in a CCC and MCC of 0. The application was unable to extract the source code.
Environment
All tests were executed on GitHub Actions CI runners to ensure:
- A controlled, reproducible environment.
- Consistent hardware and memory limitations across model runs.
🧠 tinyllama:1.1b (1B model)
- Architecture & Training
Pretrained on ~1 trillion tokens (3 epochs), built on Llama 2 architecture using FlashAttention and Lit‑GPT for inference speed. - Benchmarks
- Average ≈ 52.99 on HellaSwag, WinoGrande, ARC, MMLU and others.
- MMLU: ~25.9; BBH: ~29.3; HumanEval: ~9.15.
🧑💻 qwen2.5-coder:3b-instruct-q8_0 (3B model)
- Family & Training
Part of Qwen 2.5‑Coder series, pretrained on > 5.5 trillion tokens; supports 0.5–32B sizes. - Specialization
Instruction-tuned for code (generation, reasoning, repair) spanning 40+ languages. - Benchmarks
Achieves state-of-the-art across ≥10 code tasks; Qwen2.5‑Coder‑3B-Pass@1 ≈ 45.1% on HumanEval and 30.2% on MBPP in community reports.
mistral_7b-instruct-v0.3-q3_K_M (7B model)
- Architecture & Licensing
Fine‑tuned instruct model based on Mistral‑7B‑v0.3, with 32K context window and function-calling support; Apache 2.0 license. - Performance Highlights
Outperforms Llama 2 13B and even Llama 1 34B on multiple benchmarks; strong across code and general tasks. - Efficiency Benchmarks
LLM‑Explorer score ≈ 0.44; VRAM footprint ≈ 14.5 GB.
phi4-reasoning (14B model)
- Development & Purpose
Azure/Open-source 14B reasoning model from Phi‑4 series, trained with chain-of-thought SFT and RLHF ("Plus" variant). - Benchmarks & Comparison
Excelled on complex reasoning: AIME (2022–25), GPQA, OmniMath, Maze, 3SAT, TSP, HumanEvalPlus, MMLU‑Pro, etc. - Notable Results
Outperforms enormous models (e.g., DeepSeek‑R1 at 671B) on AIME and competitive vs o3‑mini and Claude 3.7 Sonnet.
Benchmarks show strong edge-device readiness with quantization support.
📚 Sources
🧠 TinyLlama
- All About TinyLlama 1.1B – Analytics Vidhya
- TinyLlama Evaluation Results – GitHub
- TinyLlama Paper – arXiv:2401.02385
👨💻 Qwen2.5-Coder-3B
- Qwen2.5-Coder-3B-Instruct GGUF – Hugging Face
- Qwen2.5-Coder Family Blog – QwenLM
- Qwen2.5 Paper – arXiv:2409.12186