StarCoder2 - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

StarCoder2

StarCoder2 is a family of open-source language models specifically designed for code-related tasks. Developed by the BigCode project—a collaboration led by Hugging Face and ServiceNow—it was trained on "The Stack v2," a large and permissively licensed dataset of source code. The 3-billion-parameter variant is particularly well-suited for providing a balance of high performance and resource efficiency.

Model Architecture

  • Base Architecture: Transformer decoder-only, optimized for code generation tasks.
  • Attention Mechanism: Grouped Query Attention (GQA) for faster inference speed while maintaining high quality.
  • Context Window: Supports a large context window of 16,384 tokens with a sliding window attention of 4,096 tokens, enabling it to process and understand larger codebases.
  • Training Objective: Trained using the Fill-in-the-Middle (FIM) objective, which improves its ability to complete partially written code.

Performance Highlights

  • State-of-the-Art for Size: The StarCoder2 models are among the top-performing code generation models for their respective sizes.
  • Broad Language Support: Trained on over 600 programming languages, making it highly versatile.
  • Efficiency: The 3B model provides strong performance with a significantly smaller footprint than larger models, making it ideal for local deployment and rapid development cycles.

Training Details

  • Training Data: Pre-trained on a massive 67.5 TB dataset known as "The Stack v2," which includes code from GitHub, Kaggle, and other sources with permissive licenses.
  • Data Filtering: The training data underwent extensive filtering to remove personally identifiable information (PII) and sexually explicit content.
  • Training Infrastructure: Trained on a cluster of NVIDIA A100 GPUs.

Usage and Deployment

StarCoder2 is designed for easy integration into developer tools and applications. Its permissive "BigCode OpenRAIL-M" license (which is based on MIT) makes it suitable for both research and commercial use. The model is available through Ollama with the ID starcoder2:3b.

Citation

arXiv Paper: StarCoder2 and The Stack v2: The Next Generation

Hugging Face Model Card: bigcode/starcoder2-3b

Official Project Page: The BigCode Project