tokens per second - chunhualiao/public-docs GitHub Wiki
Expectations
M1 Max MacBook:
- Llama-2 70B: ~7 TPS
- Deepseek-coder 33B: ~15 TPS
- 13B model (e.g., orac2:13b): ~35 TPS
Smaller, 1.5B models (e.g., deepseek-r1:1.5b): ~100 TPS on a 7600X CPU PC
High-end hardware: Users on more powerful hardware, like the H100 with optimized NVIDIA performance, have reported [1500+ TPS]
an example
python measure_ollama_tps.py
Testing model: gemma3:27b
Prompt length: 41 characters
Running 3 iterations...
Progress: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:37<00:00, 32.43s/it]
Results:
Average tokens generated: 1011.3
Average time: 32.43 seconds
Average tokens per second: 31.19
Testing model: gemma3:27b-it-fp16
Prompt length: 41 characters
Running 3 iterations...
Progress: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [05:24<00:00, 108.26s/it]
Results:
Average tokens generated: 1055.0
Average time: 108.26 seconds
Average tokens per second: 10.07
measuring tps
import time
import requests
import argparse
import json
from tqdm import tqdm
def measure_tps(model, prompt, n_runs=5, stream=True):
"""
Measure tokens per second for an Ollama model
Args:
model: Name of the Ollama model to use
prompt: The prompt to send to the model
n_runs: Number of runs to average over
stream: Whether to use streaming mode
Returns:
Average tokens per second
"""
url = "http://localhost:11434/api/generate"
headers = {
"Content-Type": "application/json"
}
payload = {
"model": model,
"prompt": prompt,
"stream": stream
}
results = []
print(f"\nTesting model: {model}")
print(f"Prompt length: {len(prompt)} characters")
print(f"Running {n_runs} iterations...")
for i in tqdm(range(n_runs), desc="Progress"):
start_time = time.time()
token_count = 0
if stream:
response = requests.post(url, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
if line:
data = json.loads(line)
if "response" in data:
token_count += 1
if data.get("done", False):
total_time = time.time() - start_time
results.append((token_count, total_time))
break
else:
response = requests.post(url, headers=headers, json=payload)
data = response.json()
total_time = time.time() - start_time
# Rough token estimate based on words (not 100% accurate)
token_count = len(data["response"].split()) * 1.3
results.append((token_count, total_time))
# Calculate stats
tps_values = [tokens/duration for tokens, duration in results]
avg_tps = sum(tps_values) / len(tps_values)
avg_tokens = sum(tokens for tokens, _ in results) / len(results)
avg_time = sum(duration for _, duration in results) / len(results)
print("\nResults:")
print(f"Average tokens generated: {avg_tokens:.1f}")
print(f"Average time: {avg_time:.2f} seconds")
print(f"Average tokens per second: {avg_tps:.2f}")
return avg_tps
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Measure Ollama API performance in tokens per second")
parser.add_argument("--model", type=str, default="llama2", help="Model name to test")
parser.add_argument("--prompt", type=str, default="Explain quantum computing in simple terms", help="Prompt to use")
parser.add_argument("--runs", type=int, default=3, help="Number of runs to average over")
parser.add_argument("--no-stream", action="store_true", help="Disable streaming mode")
args = parser.parse_args()
measure_tps(
model=args.model,
prompt=args.prompt,
n_runs=args.runs,
stream=not args.no_stream
)
Basic usage with default parameters
- python measure_ollama_tps.py
Test a specific model with a custom prompt
- python measure_ollama_tps.py --model "mistral" --prompt "Write a short story about a robot" --runs 5
Use non-streaming mode (less accurate token counting)
- python measure_ollama_tps.py --no-stream