Comparison of 1b, 3b, 7b and 14b models - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Comparison of LLM Sizes and Performance

This page documents the performance, resource usage, and test results of various large language models (LLMs) of different sizes, ranging from 1B to 14B parameters.

Evaluated Models

Metric/Characteristic	1B Models	3B Models	7B Models	14B Models
Model	`tinyllama:1.1b`	`qwen2.5-coder:3b-instruct-q8_0`	`mistral_7b-instruct-v0.3-q3_K_M`	`phi4-reasoning`
Generation time	35.85 s	96.0 s	207.65 s	3129.37 s
CCC (Calculated output test score)	67	102	32	0
MCC (Calculated output test score)	5	9	4	0
Max memory (RAM) used by Docker container	0.98 GB	3.90 GB	4.9 GB	13.20 GB
Total downloaded memory (LLM image)	638 MB	3.29 GB	3.52 GB	11.1 GB
GitHub Actions run	Link	Link	Link	Link

Summary

Smaller models (1B–3B) are significantly faster and more efficient in memory consumption, while still achieving usable output quality (CCC/MCC).
Larger models (7B–14B) require much more memory and time, but did not show better test performance in this case.
Phi-4 (14B) had the highest resource demand but failed the test case evaluations, resulting in a CCC and MCC of 0. The application was unable to extract the source code.

Environment

All tests were executed on GitHub Actions CI runners to ensure:

A controlled, reproducible environment.
Consistent hardware and memory limitations across model runs.

🧠 tinyllama:1.1b (1B model)

Architecture & Training
Pretrained on ~1 trillion tokens (3 epochs), built on Llama 2 architecture using FlashAttention and Lit‑GPT for inference speed.
Benchmarks
- Average ≈ 52.99 on HellaSwag, WinoGrande, ARC, MMLU and others.
- MMLU: ~25.9; BBH: ~29.3; HumanEval: ~9.15.

🧑‍💻 qwen2.5-coder:3b-instruct-q8_0 (3B model)

Family & Training
Part of Qwen 2.5‑Coder series, pretrained on > 5.5 trillion tokens; supports 0.5–32B sizes.
Specialization
Instruction-tuned for code (generation, reasoning, repair) spanning 40+ languages.
Benchmarks
Achieves state-of-the-art across ≥10 code tasks; Qwen2.5‑Coder‑3B-Pass@1 ≈ 45.1% on HumanEval and 30.2% on MBPP in community reports.

mistral_7b-instruct-v0.3-q3_K_M (7B model)

Architecture & Licensing
Fine‑tuned instruct model based on Mistral‑7B‑v0.3, with 32K context window and function-calling support; Apache 2.0 license.
Performance Highlights
Outperforms Llama 2 13B and even Llama 1 34B on multiple benchmarks; strong across code and general tasks.
Efficiency Benchmarks
LLM‑Explorer score ≈ 0.44; VRAM footprint ≈ 14.5 GB.

phi4-reasoning (14B model)

Development & Purpose
Azure/Open-source 14B reasoning model from Phi‑4 series, trained with chain-of-thought SFT and RLHF ("Plus" variant).
Benchmarks & Comparison
Excelled on complex reasoning: AIME (2022–25), GPQA, OmniMath, Maze, 3SAT, TSP, HumanEvalPlus, MMLU‑Pro, etc.
Notable Results
Outperforms enormous models (e.g., DeepSeek‑R1 at 671B) on AIME and competitive vs o3‑mini and Claude 3.7 Sonnet.
Benchmarks show strong edge-device readiness with quantization support.

Comparison of 1b, 3b, 7b and 14b models - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Comparison of LLM Sizes and Performance

Evaluated Models

Summary

Environment

🧠 tinyllama:1.1b (1B model)

🧑‍💻 qwen2.5-coder:3b-instruct-q8_0 (3B model)

mistral_7b-instruct-v0.3-q3_K_M (7B model)

phi4-reasoning (14B model)

📚 Sources

🧠 TinyLlama

👨‍💻 Qwen2.5-Coder-3B

🧾 Mistral-7B-Instruct v0.3

🧠 Phi-4 Reasoning