AI Models Guide - diShine-digital-agency/ai-prompt-library GitHub Wiki
Comprehensive comparison of AI models covered in the Prompt Library โ pricing, capabilities, latency, rate limits, and selection guidance.
- Claude (Anthropic)
- GPT (OpenAI)
- Gemini (Google)
- Llama (Meta)
- Mistral
- Pricing Comparison Table
- Benchmark Comparisons by Task
- Latency Comparison
- Rate Limits
- Decision Tree: Which Model to Use
- Cost Optimization Strategies
- Retired Models
Claude models excel at instruction following, long-context analysis, creative writing, and coding. They respond exceptionally well to structured prompts using XML tags.
| Model | Context | Input Price | Output Price | Best For |
|---|---|---|---|---|
| Opus 4.6 | 1M tokens | $5.00/M | $25.00/M | Deep reasoning, multi-agent, massive context analysis |
| Sonnet 4.6 | 200K tokens | $3.00/M | $15.00/M | Balanced workhorse โ coding, design, knowledge work |
| Haiku 4.5 | 200K tokens | $1.00/M | $5.00/M | Fast response, near-frontier quality, high-volume |
XML Tags โ Claude treats XML tags as semantic containers. High-value patterns:
<instructions>Primary task definition</instructions>
<context>Background information</context>
<constraints>Hard rules that override other instructions</constraints>
<examples>Input/output examples</examples>Extended Thinking โ Available on Opus 4.6, allows the model to "think" through complex problems before responding. Triggers deeper reasoning at the cost of higher latency and tokens.
Prefill Technique โ Start Claude's response to steer format and tone:
{
"messages": [
{"role": "user", "content": "Analyze this data..."},
{"role": "assistant", "content": "{\"analysis\":"}
]
}- Opus 4.6 โ When you need the absolute best reasoning, processing documents over 200K tokens, or running multi-agent workflows
- Sonnet 4.6 โ The workhorse โ use for 90% of tasks (coding, analysis, writing, tool use)
- Haiku 4.5 โ High-volume classification, routing, quick extractions, or where latency/cost matters most
GPT models provide strong general-purpose performance, structured outputs, and function calling.
| Model | Notes |
|---|---|
| GPT-5.4 | Most capable GPT, "Thinking" mode available |
| GPT-5.4 Pro | Maximum quality tier, premium pricing |
| GPT-5.3 Instant | Default ChatGPT โ fast everyday workhorse |
| GPT-5.3-Codex | Agentic coding model |
| GPT-5.2-Codex | Previous generation coding model |
JSON Mode โ Force output to be valid JSON:
{
"response_format": { "type": "json_object" }
}Structured Outputs โ Define an exact JSON schema the model must follow:
{
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "analysis",
"schema": {
"type": "object",
"properties": {
"summary": { "type": "string" },
"score": { "type": "number" }
},
"required": ["summary", "score"]
}
}
}
}Function Calling โ Define tools the model can invoke, enabling agent-like behavior with structured tool use.
Gemini models offer native multimodal capabilities, Google Search grounding, and code execution.
| Model | Notes |
|---|---|
| Gemini 3 Pro | Frontier reasoning, multimodal, agentic |
| Gemini 3 Flash | New default โ fast, capable |
| Gemini 2.5 Pro | $1.25/$10.00 per M tokens, 1M+ context โ deprecated June 2026 |
| Gemini 2.5 Flash | $0.30/$2.50 per M tokens, 1M context โ deprecated June 2026 |
Multimodal Input โ Native support for images, video, and audio in prompts. No need to describe visual content in text.
Google Search Grounding โ Ground responses in real-time web search results for up-to-date information.
Code Execution โ The model can execute code as part of its response, useful for data analysis and computation tasks.
Open-weight models you can self-host. No API costs โ you pay only for compute.
| Model | Parameters | Context | Notes |
|---|---|---|---|
| Llama 4 Scout | MoE | 10M tokens | Fits single H100, 10M context window |
| Llama 4 Maverick | 400B total (MoE) | Large | Beats GPT-4o on benchmarks, open-weight |
| Llama 4 Behemoth | 2T (MoE) | Large | Preview only, teacher model |
| Llama 3.3 | 70B (dense) | 128K | Best for fine-tuning, mature ecosystem |
| Llama 3.2 | 1Bโ3B | Varies | Edge/mobile deployment |
- Open-weight โ download and run on your own infrastructure
- No API costs โ only compute costs (GPU rental or owned hardware)
- Fine-tuning โ customize for your specific domain
- MoE architecture (Llama 4) โ high parameter count with efficient inference
- Privacy โ data never leaves your infrastructure
- Data must stay on your infrastructure (regulatory, privacy)
- High-volume inference where API costs would be prohibitive
- Fine-tuning for specialized domains
- Edge/mobile deployment (Llama 3.2)
- Research and experimentation
European AI lab offering a range of models from tiny to frontier, with excellent price-to-performance ratios.
| Model | Parameters | Notes |
|---|---|---|
| Mistral Large 3 | MoE 41B/675B | Scores 9.4/10 overall, frontier quality |
| Mistral Medium 3 | โ | $0.40/$2.00 per M โ best value in its class |
| Codestral | โ | 86.6% HumanEval, 80+ languages, 256K context |
| Devstral 2 | โ | Agentic coding model |
| Magistral | โ | Reasoning-focused model |
| Pixtral | โ | Vision-capable model |
| Ministral 3 | โ | Tiny, fast, edge deployment |
- Best price-to-performance: Mistral Medium 3 at $0.40/M input offers GPT-4-class quality at 1/5 the cost
- Multilingual excellence: Strong across European and global languages
- MoE architecture: Efficient inference with high parameter counts
- Code specialization: Codestral and Devstral for development workflows
- EU data sovereignty: European hosting options
Prices per million tokens (April 2026):
| Model | Input | Output | Context | Best For |
|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | 1M | Deep reasoning, multi-agent |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 200K | Balanced workhorse |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K | High-volume, fast |
| GPT-5.4 | varies | varies | Large | Most capable GPT |
| GPT-5.3 Instant | mid | mid | Large | Everyday tasks |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M+ | Long context (deprecated June 2026) |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1M | Budget (deprecated June 2026) |
| Mistral Medium 3 | $0.40 | $2.00 | Large | Best value |
| Llama 4 Maverick | Free* | Free* | Large | Self-hosted |
| Llama 4 Scout | Free* | Free* | 10M | Extreme context |
*Llama models are open-weight โ you pay only for compute (GPU hosting/rental).
Batch API Discounts:
- OpenAI Batch API: 50% off (24hr turnaround)
- Anthropic Batches API: 50% off (24hr turnaround)
- Google Batch API: varies by model
Practitioner's view of relative performance per task category:
| Task | Tier 1 (Best) | Tier 2 | Tier 3 |
|---|---|---|---|
| Complex reasoning | Claude Opus 4.6, GPT-5.4 Pro | Gemini 3 Pro, Mistral Large 3 | Llama 4 Maverick, Magistral |
| Code generation | Claude Sonnet 4.6, GPT-5.3-Codex | Codestral, Devstral 2 | Gemini 3 Flash, Llama 4 Maverick |
| Instruction following | Claude Sonnet/Opus 4.6 | GPT-5.4, Gemini 3 Pro | Mistral Large 3 |
| Creative writing | Claude Opus 4.6, GPT-5.4 | Gemini 3 Pro | Mistral Large 3 |
| Data extraction | GPT-5.4 (structured outputs) | Claude Sonnet 4.6 | Gemini 3 Flash, Mistral Medium 3 |
| Long document analysis | Claude Opus 4.6 (1M), Llama 4 Scout (10M) | Gemini 3 Pro | GPT-5.4 |
| Multilingual | Gemini 3 Pro, Mistral Large 3 | GPT-5.4, Claude 4.6 | Llama 4 |
| Vision (images) | Gemini 3 Pro, GPT-5.4 | Claude Sonnet 4.6, Pixtral | Llama 4 Maverick |
| Video understanding | Gemini 3 Pro (native) | GPT-5.4 | Llama 4 Maverick |
| Agentic coding | GPT-5.3-Codex, Devstral 2 | Claude Sonnet 4.6 | Codestral |
| Classification (volume) | Gemini 3 Flash, Mistral Medium 3 | Claude Haiku 4.5 | Ministral 3, Llama 3.2 |
| Chain-of-thought | GPT-5.4 Thinking, Magistral | Claude Opus 4.6 (extended thinking) | Gemini 3 Pro |
| Safety/refusal | Claude (most careful) | GPT-5.4 | Gemini, Mistral |
Approximate ranges for typical requests (varies by region, load, and prompt length):
| Model Tier | TTFT (median) | Throughput | Examples |
|---|---|---|---|
| Fast/economy | ~150โ300ms | ~100โ150 tok/s | Gemini 3 Flash, Claude Haiku 4.5, Mistral Medium 3, GPT-5.3 Instant |
| Balanced | ~300โ600ms | ~50โ80 tok/s | Claude Sonnet 4.6, GPT-5.4, Gemini 3 Pro, Mistral Large 3 |
| Frontier/reasoning | ~500โ1000ms | ~30โ50 tok/s | Claude Opus 4.6, GPT-5.4 Pro, thinking/reasoning modes |
| Self-hosted (A100/H100) | ~200โ500ms | ~40โ100 tok/s | Llama 4 Scout, Llama 3.3 70B |
TTFT = Time To First Token. These are rough medians for guidance, not SLAs.
| Provider | Free Tier | Paid Tier (Typical) | Enterprise |
|---|---|---|---|
| OpenAI | 3 RPM, 200 RPD | 500โ10K RPM | Custom |
| Anthropic | 5 RPM, 300 RPD | 1Kโ4K RPM | Custom |
| 15 RPM, 1500 RPD | 360โ1000 RPM | Custom | |
| Mistral | 1 RPM | 100โ500 RPM | Custom |
RPM = requests per minute, RPD = requests per day. Limits vary by model within each provider.
START: What is your primary requirement?
[Data must stay on your infrastructure?]
YES โ Llama 3.3 70B (quality) or Llama 3.2 3B (edge/mobile)
NO โ continue
[Processing video or audio natively?]
YES โ Gemini 3 Pro (native video/audio)
NO โ continue
[Documents exceeding 200K tokens?]
YES โ Claude Opus 4.6 (1M) or Llama 4 Scout (10M)
NO โ continue
[Need guaranteed JSON schema compliance?]
YES โ GPT-5.4 with structured outputs
NO โ continue
[Complex reasoning or long-form writing?]
YES โ Claude Sonnet (value) or Opus (maximum quality)
NO โ continue
[High-volume, cost-sensitive (>10K req/day)?]
YES โ How complex?
Simple โ Gemini Flash or Mistral Medium 3
Moderate โ Claude Haiku 4.5
Complex โ Claude Sonnet with batching
NO โ continue
[Code generation or review?]
YES โ Claude Sonnet 4.6, GPT-5.3-Codex, or Codestral
NO โ continue
[Default / general purpose]
Budget โ Mistral Medium 3
Quality โ Claude Sonnet 4.6 or GPT-5.4
Maximum โ Claude Opus 4.6
Use a cheap model to classify request complexity, then route to the appropriate model:
# Classify with a cheap model
classification = cheap_model.classify(
request, categories=["simple", "moderate", "complex"]
)
model_map = {
"simple": "gemini-flash", # lowest cost
"moderate": "claude-haiku", # balanced
"complex": "claude-sonnet", # highest quality
}
model = model_map[classification]response = cheap_model.generate(prompt)
if not passes_quality_check(response):
response = expensive_model.generate(prompt) # escalatetask_routing = {
"classification": "gemini-flash",
"extraction": "mistral-medium",
"summarization": "claude-haiku",
"reasoning": "claude-sonnet",
"code": "codestral",
}A well-implemented router saves 60โ80% compared to sending everything to the most expensive model.
Do not use these in new projects:
| Model | Status |
|---|---|
| GPT-4o | Retired |
| GPT-4.1 | Retired |
| GPT-4.1 mini | Retired |
| GPT-4 Turbo | Retired |
| o4-mini | Retired |
| Gemini 2.0 Flash | Retired |
| Claude 3.5 Haiku | Retired |
Navigation: โ Prompting Techniques | Tools: Linter, Optimizer, Recommender โ