Comparison of 1b, 3b, 7b and 14b models - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Comparison of LLM Sizes and Performance

This page documents the performance, resource usage, and test results of various large language models (LLMs) of different sizes, ranging from 1B to 14B parameters.

Evaluated Models

Metric/Characteristic 1B Models 3B Models 7B Models 14B Models
Model tinyllama:1.1b qwen2.5-coder:3b-instruct-q8_0 mistral_7b-instruct-v0.3-q3_K_M phi4-reasoning
Generation time 35.85 s 96.0 s 207.65 s 3129.37 s
CCC (Calculated output test score) 67 102 32 0
MCC (Calculated output test score) 5 9 4 0
Max memory (RAM) used by Docker container 0.98 GB 3.90 GB 4.9 GB 13.20 GB
Total downloaded memory (LLM image) 638 MB 3.29 GB 3.52 GB 11.1 GB
GitHub Actions run Link Link Link Link

Summary

  • Smaller models (1B–3B) are significantly faster and more efficient in memory consumption, while still achieving usable output quality (CCC/MCC).
  • Larger models (7B–14B) require much more memory and time, but did not show better test performance in this case.
  • Phi-4 (14B) had the highest resource demand but failed the test case evaluations, resulting in a CCC and MCC of 0. The application was unable to extract the source code.

Environment

All tests were executed on GitHub Actions CI runners to ensure:

  • A controlled, reproducible environment.
  • Consistent hardware and memory limitations across model runs.

🧠 tinyllama:1.1b (1B model)

  • Architecture & Training
    Pretrained on ~1 trillion tokens (3 epochs), built on Llama 2 architecture using FlashAttention and Lit‑GPT for inference speed.
  • Benchmarks
    • Average ≈ 52.99 on HellaSwag, WinoGrande, ARC, MMLU and others.
    • MMLU: ~25.9; BBH: ~29.3; HumanEval: ~9.15.

🧑‍💻 qwen2.5-coder:3b-instruct-q8_0 (3B model)

  • Family & Training
    Part of Qwen 2.5‑Coder series, pretrained on > 5.5 trillion tokens; supports 0.5–32B sizes.
  • Specialization
    Instruction-tuned for code (generation, reasoning, repair) spanning 40+ languages.
  • Benchmarks
    Achieves state-of-the-art across ≥10 code tasks; Qwen2.5‑Coder‑3B-Pass@1 ≈ 45.1% on HumanEval and 30.2% on MBPP in community reports.

mistral_7b-instruct-v0.3-q3_K_M (7B model)

  • Architecture & Licensing
    Fine‑tuned instruct model based on Mistral‑7B‑v0.3, with 32K context window and function-calling support; Apache 2.0 license.
  • Performance Highlights
    Outperforms Llama 2 13B and even Llama 1 34B on multiple benchmarks; strong across code and general tasks.
  • Efficiency Benchmarks
    LLM‑Explorer score ≈ 0.44; VRAM footprint ≈ 14.5 GB.

phi4-reasoning (14B model)

  • Development & Purpose
    Azure/Open-source 14B reasoning model from Phi‑4 series, trained with chain-of-thought SFT and RLHF ("Plus" variant).
  • Benchmarks & Comparison
    Excelled on complex reasoning: AIME (2022–25), GPQA, OmniMath, Maze, 3SAT, TSP, HumanEvalPlus, MMLU‑Pro, etc.
  • Notable Results
    Outperforms enormous models (e.g., DeepSeek‑R1 at 671B) on AIME and competitive vs o3‑mini and Claude 3.7 Sonnet.
    Benchmarks show strong edge-device readiness with quantization support.

📚 Sources

🧠 TinyLlama

👨‍💻 Qwen2.5-Coder-3B

🧾 Mistral-7B-Instruct v0.3

🧠 Phi-4 Reasoning