Comparing AI Models: Timings - rapmd73/Companion GitHub Wiki
Comparing AI Models Timings
Overview
This analysis focuses on the relationship between cost and average response time across various AI models. The data reflects a cost-driven approach, emphasizing affordability and practicality for a privately funded project. The table is not intended to reflect model comprehensiveness or capability but solely to compare average response time based on usage frequency and associated costs.
Table of AI Models
| Provider | Model Name | Usage Count | Average Response Time (ms) |
|---|---|---|---|
| OpenAI | gpt-4o-mini | 564 | 6.196728 |
| TogetherAI | meta-llama/Llama-Vision-Free | 548 | 2.947308 |
| Cohere | command-r-plus-08-2024 | 238 | 35.785773 |
| OpenRouter | meta-llama/llama-3.1-405b-instruct:free | 55 | 5.174699 |
| OpenRouter | meta-llama/llama-3.2-3b-instruct:free | 28 | 1.257336 |
| Anthropic | claude-3-haiku-20240307 | 28 | 0.567675 |
| Perplexity | llama-3.1-sonar-small-128k-online | 18 | 5.427681 |
| OpenRouter | nousresearch/hermes-3-llama-3.1-405b:free | 12 | 6.468109 |
| Ollama | tinyllama | 11 | 34.074528 |
| Anthropic | claude-3-5-haiku-20241022 | 11 | 5.215944 |
| OpenRouter | liquid/lfm-40b:free | 10 | 21.880264 |
| HuggingFace | Qwen/Qwen2-VL-7B-Instruct | 10 | 1.861794 |
| TogetherAI | mistralai/Mistral-7B-Instruct-v0.3 | 5 | 3.967040 |
| OpenRouter | gryphe/mythomax-l2-13b:free | 2 | 5.825755 |
| HuggingFace | AIDC-AI/Ovis1.6-Gemma2-9B | 2 | 3.204573 |
| Cohere | command-r-08-2024 | 2 | 1.859857 |
| OpenRouter | meta-llama/llama-3.1-8b-instruct:free | 1 | 2.686100 |
| OpenRouter | meta-llama/llama-3.1-70b-instruct:free | 1 | 0.835187 |
| HuggingFace | meta-llama/Llama-3.2-11B-Vision | 1 | 0.360707 |
| Anthropic | claude-3.5-haiku-20241022 | 1 | 0.199897 |
Context and Comprehensiveness
This table is not intended to reflect the comprehensiveness of the models or their capabilities. For instance, models such as OpenAI's GPT-4 and Anthropic's Claude feature 128K token context windows, far exceeding smaller models like Ollama's TinyLlama, which offers an 8K token limit. This comparison is purely based on cost and response time relative to usage frequency.
Cost vs. Usage
Usage trends reveal that more frequently used models, like OpenAI's GPT-4o-mini, tend to be more cost-effective. However, certain models, such as Ollama’s TinyLlama, incur additional expenses due to self-hosting factors like electricity, maintenance, and infrastructure.
Considerations were also based on rate limits and the functionality within those limits. Some models have strict usage caps or throttling that impact how efficiently they can be leveraged over time. For example, models with lower rate limits may require more careful management to avoid disruptions, while models with higher limits may allow for more continuous use without encountering performance bottlenecks.
Cohere models represent one of the higher costs per response. However, extended usage has been made possible through a credits grant, allowing for broader experimentation within the constraints of this project.
Project Context
This project is entirely self-funded, requiring careful consideration of cost efficiency. The use of certain models is influenced by specific needs, such as smaller context windows for lightweight applications or larger models for extended-context tasks. This framework reflects a focus on balancing affordability, accessibility, and the practical application of AI models within personal budgetary constraints. If you would like more models tested, please consider sponsoring this project.