open‐weight models and context window sizes - chunhualiao/public-docs GitHub Wiki
Top Open-Weight Large Language Models Supported by Ollama
Ollama is a tool that allows you to run and interact with large language models (LLMs) locally on your machine. It supports a variety of open-weight LLMs, which are models whose code and architecture are publicly accessible. This means you can use, modify, and distribute these models freely. Ollama bundles model weights, configuration, and data into a single package, defined by a Modelfile 1. Ollama simplifies the process of setting up and running these models, even with limited computational resources. Using open-weight models with Ollama can be beneficial in terms of cost and security, as it allows you to avoid the costs associated with using proprietary LLMs and provides more control over your data 2.
This article lists some of the top open-weight LLMs supported by Ollama and their context window sizes. The context window size is a crucial factor in LLMs, as it determines the amount of text the model can consider at once when generating responses. Think of it like a person's short-term memory: a larger context window is like having a better memory, allowing the model to "remember" more of the previous conversation and generate more coherent and contextually relevant responses 3.
Supported Models and Context Window Sizes
The following table lists the top open-weight LLMs supported by Ollama, their model sizes, and their context window sizes:
Model Name | Model Size | Input Context Window Size | Output Context Window Size |
---|---|---|---|
DeepSeek-R1 | 7B 4 | 128k tokens 5 | 2048 tokens 6 |
Llama 3.3 | 70B 4 | 130k tokens 7 | 2048 tokens 6 |
Phi 4 | 14B 4 | 16k tokens 8 | 2048 tokens 6 |
Mistral | 7B 4 | 32k tokens 9 | 2048 tokens 6 |
Gemma 2 | 2B 4 | 10 million tokens 10 | 2048 tokens 6 |
Llama 3.2 | 3B 4 | 128k tokens 11 | 2048 tokens 6 |
Llama 3.1 | 8B 4 | 128k tokens 12 | 2048 tokens 6 |
Neural Chat | 7B 4 | Not available | 2048 tokens 6 |
Starling | 7B 4 | Not available | 2048 tokens 6 |
LLaVA | 7B 4 | 32k tokens 13 | 2048 tokens 6 |
Solar | 10.7B 4 | 4k tokens 14 | 2048 tokens 6 |
Llama 2 | 7B 4 | 4k tokens 15 | 2048 tokens 6 |
Moondream 2 | 1.4B 4 | Not available | 2048 tokens 6 |
It's important to note that the context window size can be adjusted for the current session using the /set parameter num_ctx command in the Ollama CLI or by specifying the num_ctx parameter in API requests 6. However, to permanently change the context window size, you need to create a new Modelfile, as explained in snippet 1. Increasing the context window size may impact performance and resource usage.
Interacting with Models
Ollama offers flexibility in how you interact with the models. You can interact with them directly in the terminal using the ollama run <name-of-model> command 1. This allows you to have a conversation with the model and explore its capabilities in a command-line environment.
Alternatively, you can interact with the models through the Ollama API. This allows you to integrate the models into your own applications and workflows. You can send application/json requests to the API endpoint of Ollama to interact with the models 1.
Factors to Consider When Choosing an LLM
When choosing an LLM for your specific needs, it's essential to consider several factors beyond the context window size:
- Performance: The performance of an LLM can vary significantly depending on its size and architecture. Larger models generally offer better performance but require more computational resources.
- Hardware Requirements: Different LLMs have different hardware requirements, particularly in terms of RAM and GPU memory. Ensure that your hardware can support the chosen model.
- Specific Task Requirements: The ideal LLM for a particular task depends on the nature of the task. For example, some models are better suited for code generation, while others excel at text summarization or translation.
- Quantization: Ollama supports different quantization levels for models, which can affect the model's size and performance. Consider the trade-off between model size and performance when choosing a quantization level.
- Community and Support: The level of community support and available resources can vary for different LLMs. Consider choosing a model with a strong community and readily available documentation.
Model Details and Use Cases
Here's a brief overview of the models listed in the table, including their strengths, weaknesses, and potential use cases:
- DeepSeek-R1: This model is known for its strong reasoning capabilities and long context window. It can be used for research tasks, question answering, and other applications that require a deep understanding of the input text.
- Llama 3.3: This is a state-of-the-art model from Meta that offers excellent performance across various tasks, including code generation, translation, and text summarization.
- Phi 4: This model from Microsoft is designed for reasoning-focused tasks and offers a good balance between performance and efficiency.
- Mistral: This model is known for its efficiency and strong performance in conversational tasks. It can be used for chatbots, dialogue systems, and other applications that require interactive communication.
- Gemma 2: This model from Google DeepMind is designed for long-context tasks and can handle sequences up to 10 million tokens in length. It can be used for tasks such as document summarization and question answering on extensive texts.
- Llama 3.2: This model from Meta is designed for efficiency and supports a long context window. It can be used for various tasks, including text summarization, translation, and code generation.
- Llama 3.1: This model from Meta offers a good balance between performance and efficiency and supports a long context window. It can be used for various tasks, including dialogue generation, text summarization, and code generation.
- Neural Chat: This model is specifically designed for conversational tasks and can be used for building chatbots and other interactive applications.
- Starling: This model is designed for general-purpose language tasks and offers a good balance between performance and efficiency.
- LLaVA: This is a multimodal model that can process both text and images. It can be used for tasks such as image captioning, visual question answering, and visual dialogue.
- Solar: This model is known for its strong performance in reasoning and instruction-following tasks. It can be used for various applications, including question answering, text summarization, and code generation.
- Llama 2: This model from Meta is a versatile and efficient model that can be used for various tasks, including text generation, translation, and question answering.
- Moondream 2: This is a lightweight model that is optimized for speed and efficiency. It can be used for tasks such as text summarization and question answering on mobile devices.
Conclusion
Ollama offers a diverse selection of open-weight LLMs with varying context window sizes, catering to different needs and use cases. The trend in LLMs is towards larger context windows, which allows for more coherent and contextually relevant responses, especially in tasks involving long documents or complex conversations. This trend has significant implications for various applications, including chatbots, research assistants, and code generation tools. When choosing a model, consider the context window size alongside other factors like model size, performance, hardware requirements, and the specific task you intend to perform. By understanding the capabilities and limitations of each model, you can leverage Ollama to effectively run and interact with LLMs locally on your machine.
Works cited
1. How to Increase Ollama Context Size: A Complete Step-by-Step Guide - DeepAI, accessed February 18, 2025, https://deepai.tn/glossary/ollama/how-increase-ollama-context-size/
2. Getting Started with Ollama and Web UI - YouTube, accessed February 18, 2025, https://www.youtube.com/watch?v=BzFafshQkWw
3. Taking Advantage of the Long Context of Llama 3.1 - Codesphere, accessed February 18, 2025, https://codesphere.com/articles/taking-advantage-of-the-long-context-of-llama-3-1-2
4. nomic-embed-text - Ollama, accessed February 18, 2025, https://ollama.com/search?c=embedding
5. shaktiwadekar.medium.com, accessed February 18, 2025, https://shaktiwadekar.medium.com/deepseek-r1-model-architecture-853fefac7050#:~:text=DeepSeek%2DR1's%20input%20context%20length,128K%2C%20utilizing%20the%20YaRN%20technique.
6. www.restack.io, accessed February 18, 2025, https://www.restack.io/p/ollama-answer-context-size-cat-ai
7. meta-llama/Llama-3.3-70B-Instruct · What Happens If the Prompt Exceeds 8,196 Tokens? And difference between input limit and context length limit? - Hugging Face, accessed February 18, 2025, https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/discussions/36
8. Phi-4 quantization and inference speedup | Microsoft Community Hub, accessed February 18, 2025, https://techcommunity.microsoft.com/blog/machinelearningblog/phi-4-quantization-and-inference-speedup/4360047
9. Au Large | Mistral AI, accessed February 18, 2025, https://mistral.ai/news/mistral-large
10. Expand the context window of Gemma 2B to 10 million - Brain Titan, accessed February 18, 2025, https://braintitan.medium.com/expand-the-context-window-of-gemma-2b-to-10-million-bc3f163938d8
11. Mozilla/Llama-3.2-3B-Instruct-llamafile - Hugging Face, accessed February 18, 2025, https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafile
12. Mozilla/Meta-Llama-3.1-8B-llamafile - Hugging Face, accessed February 18, 2025, https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-llamafile
13. Llava v1.6 Mistral 7B—LLM Evaluation by Telnyx, accessed February 18, 2025, https://telnyx.com/llm-library/llava-v1-6-mistral-7b-hf
14. SOLAR 10.7B Instruct V1.0 By upstage - LLM Explorer - EXTRACTUM, accessed February 18, 2025, https://llm.extractum.io/model/upstage%2FSOLAR-10.7B-Instruct-v1.0,5KwUWNTl8dKlCxQ8QeQtzZ
15. Context length in LLMs: All you need to know - AGI Sphere, accessed February 18, 2025, https://agi-sphere.com/context-length/