Comparing AI Models: Timings - rapmd73/Companion GitHub Wiki

Comparing AI Models Timings

Overview

This analysis focuses on the relationship between cost and average response time across various AI models. The data reflects a cost-driven approach, emphasizing affordability and practicality for a privately funded project. The table is not intended to reflect model comprehensiveness or capability but solely to compare average response time based on usage frequency and associated costs.

Table of AI Models

Provider	Model Name	Usage Count	Average Response Time (ms)
OpenAI	gpt-4o-mini	564	6.196728
TogetherAI	meta-llama/Llama-Vision-Free	548	2.947308
Cohere	command-r-plus-08-2024	238	35.785773
OpenRouter	meta-llama/llama-3.1-405b-instruct:free	55	5.174699
OpenRouter	meta-llama/llama-3.2-3b-instruct:free	28	1.257336
Anthropic	claude-3-haiku-20240307	28	0.567675
Perplexity	llama-3.1-sonar-small-128k-online	18	5.427681
OpenRouter	nousresearch/hermes-3-llama-3.1-405b:free	12	6.468109
Ollama	tinyllama	11	34.074528
Anthropic	claude-3-5-haiku-20241022	11	5.215944
OpenRouter	liquid/lfm-40b:free	10	21.880264
HuggingFace	Qwen/Qwen2-VL-7B-Instruct	10	1.861794
TogetherAI	mistralai/Mistral-7B-Instruct-v0.3	5	3.967040
OpenRouter	gryphe/mythomax-l2-13b:free	2	5.825755
HuggingFace	AIDC-AI/Ovis1.6-Gemma2-9B	2	3.204573
Cohere	command-r-08-2024	2	1.859857
OpenRouter	meta-llama/llama-3.1-8b-instruct:free	1	2.686100
OpenRouter	meta-llama/llama-3.1-70b-instruct:free	1	0.835187
HuggingFace	meta-llama/Llama-3.2-11B-Vision	1	0.360707
Anthropic	claude-3.5-haiku-20241022	1	0.199897

Context and Comprehensiveness

This table is not intended to reflect the comprehensiveness of the models or their capabilities. For instance, models such as OpenAI's GPT-4 and Anthropic's Claude feature 128K token context windows, far exceeding smaller models like Ollama's TinyLlama, which offers an 8K token limit. This comparison is purely based on cost and response time relative to usage frequency.

Cost vs. Usage

Usage trends reveal that more frequently used models, like OpenAI's GPT-4o-mini, tend to be more cost-effective. However, certain models, such as Ollama’s TinyLlama, incur additional expenses due to self-hosting factors like electricity, maintenance, and infrastructure.

Considerations were also based on rate limits and the functionality within those limits. Some models have strict usage caps or throttling that impact how efficiently they can be leveraged over time. For example, models with lower rate limits may require more careful management to avoid disruptions, while models with higher limits may allow for more continuous use without encountering performance bottlenecks.

Cohere models represent one of the higher costs per response. However, extended usage has been made possible through a credits grant, allowing for broader experimentation within the constraints of this project.

Project Context

This project is entirely self-funded, requiring careful consideration of cost efficiency. The use of certain models is influenced by specific needs, such as smaller context windows for lightweight applications or larger models for extended-context tasks. This framework reflects a focus on balancing affordability, accessibility, and the practical application of AI models within personal budgetary constraints. If you would like more models tested, please consider sponsoring this project.