StarCoder2 - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
StarCoder2
StarCoder2 is a family of open-source language models specifically designed for code-related tasks. Developed by the BigCode project—a collaboration led by Hugging Face and ServiceNow—it was trained on "The Stack v2," a large and permissively licensed dataset of source code. The 3-billion-parameter variant is particularly well-suited for providing a balance of high performance and resource efficiency.
Model Architecture
- Base Architecture: Transformer decoder-only, optimized for code generation tasks.
- Attention Mechanism: Grouped Query Attention (GQA) for faster inference speed while maintaining high quality.
- Context Window: Supports a large context window of 16,384 tokens with a sliding window attention of 4,096 tokens, enabling it to process and understand larger codebases.
- Training Objective: Trained using the Fill-in-the-Middle (FIM) objective, which improves its ability to complete partially written code.
Performance Highlights
- State-of-the-Art for Size: The StarCoder2 models are among the top-performing code generation models for their respective sizes.
- Broad Language Support: Trained on over 600 programming languages, making it highly versatile.
- Efficiency: The 3B model provides strong performance with a significantly smaller footprint than larger models, making it ideal for local deployment and rapid development cycles.
Training Details
- Training Data: Pre-trained on a massive 67.5 TB dataset known as "The Stack v2," which includes code from GitHub, Kaggle, and other sources with permissive licenses.
- Data Filtering: The training data underwent extensive filtering to remove personally identifiable information (PII) and sexually explicit content.
- Training Infrastructure: Trained on a cluster of NVIDIA A100 GPUs.
Usage and Deployment
StarCoder2 is designed for easy integration into developer tools and applications. Its permissive "BigCode OpenRAIL-M" license (which is based on MIT) makes it suitable for both research and commercial use. The model is available through Ollama with the ID starcoder2:3b
.
Citation
arXiv Paper: StarCoder2 and The Stack v2: The Next Generation