Choosing the Right Dataset for LLM Training on the University HPC - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki
1. Why Hugging Face Beats Kaggle
When selecting a dataset platform for training Large Language Models (LLMs) on HPC infrastructure, HuggingFace emerges as the superior choice over Kaggle. This section outlines the key differences and explains why HuggingFace aligns better with professional ML development requirements.
Platform Comparison
Aspect | Kaggle | HuggingFace |
---|---|---|
Download Method | ZIP/CSV files | Streaming API |
Dataset Size | Usually < 10GB | Up to TB scale |
Licensing | Often unclear, optional | Explicitly filtered, mandatory |
Preprocessing | Optional, often missing | Mandatory field |
Updates | Rare, Snapshots | Regular, Continuous versioning |
Integration | Pandas/Manual | Native datasets library |
Memory Efficiency | Load entire dataset | Stream on-demand |
HPC Optimization
- Streaming: Process data without downloading entire dataset
- Memory Efficiency: Load only required batches into memory
- Checkpoint Support: Resume interrupted training seamlessly
Quick-Start
Dependencies
pip install datasets huggingface_hub
Authenticate
- Create Huggingface account at https://huggingface.co/
- Create Huggingface token at https://huggingface.co/settings/tokens
- Run
huggingface-cli login
and paste your personal access token
Code Example with small dataset
from datasets import load_dataset
dataset = load_dataset(
"code_search_net",
"python", # Language filter
split="train",
streaming=True, # Enable streaming for HPC
trust_remote_code=True
)
# Fetch a small batch to demonstrate
batch = [next(iter(dataset))]
print(f"Loaded {len(batch)} examples")
print(f"Available fields: {list(batch[0].keys())}")
Conclusion
While Kaggle serves well for exploratory data analysis and competitions, HuggingFace provides the infrastructure, legal clarity, and technical capabilities required for training production-grade LLMs on HPC systems. The streaming capabilities alone make it indispensable for handling the massive datasets needed for effective code generation models.
2. Open-Source Python Datasets for LLM Training on HPC Infrastructure
When developing Large Language Models (LLMs) for code generation on High-Performance Computing (HPC) infrastructure, dataset selection represents a critical architectural decision. This section evaluates the current landscape of open-source Python training datasets, with particular emphasis on scalability, legal compliance, and technical integration requirements for production HPC environments.
The Dataset
The Stack v2 emerges as the superior dataset for training production-grade code LLMs, representing the next generation of curated code datasets. Developed as part of the BigCode Project, this dataset addresses the fundamental limitations of previous code collections through advanced deduplication techniques, comprehensive license verification, and unprecedented scale.
Technical Specifications
The Stack v2 encompasses 67.5TB of total data containing over 3.28B unique files from 104.2M GitHub repositories, collected through systematic traversal of the Software Heritage 2023-09-06 graph dataset. The dataset provides four distinct variants optimized for different training scenarios:
Variant | Size | Description | Use Case |
---|---|---|---|
bigcode/the-stack-v2 |
67.5TB | Complete dataset with file IDs | Full-scale training |
bigcode/the-stack-v2-dedup |
32.1TB | Near-deduplicated version | Quality-focused training |
bigcode/the-stack-v2-train-full-ids |
~900B tokens | 600+ languages, repository-grouped | Production training |
bigcode/the-stack-v2-train-smol-ids |
~200B tokens | 17 core languages, repository-grouped | Efficient training |
Advanced Deduplication and Quality Assurance
The dataset implements near-deduplication on top of exact deduplication, removing approximately 40% of permissively licensed files identified as duplicates. This process significantly enhances training efficiency by reducing redundant patterns while preserving linguistic diversity across the 658 supported programming languages.
Comprehensive License Management
The Stack v2 implements a sophisticated multi-tier license detection system:
- Repository-level extraction from GitHub Archive data
- File-level analysis using ScanCode Toolkit for 96.93% of repositories lacking explicit licenses
- Permissive license filtering based on Blue Oak Council standards and ScanCode categorization
- Propagation mechanisms for license inheritance within repository hierarchies
All included repositories comply with permissive open-source licenses (MIT, Apache, BSD, etc.), ensuring legal compliance for commercial applications.
Ethical Considerations and Opt-Out Mechanisms
The Stack v2 implements comprehensive data governance through:
- "Am I In The Stack?" verification system
- Proactive opt-out mechanisms for developers
- Regular updates removing opted-out repositories
- PII minimization through deduplication processes
Alternative Datasets: Comparative Analysis
While The Stack v2 represents the state-of-the-art, specific use cases may benefit from alternative datasets. The following analysis provides a comprehensive comparison of available options:
Dataset | Size (Python) | License Compliance | Recommended Use Case |
---|---|---|---|
The Stack v2 | ~200GB | ✅ Comprehensive | Production LLM Training |
StarCoder2 Data | ~35GB | ✅ Apache 2.0 | Rapid Prototyping |
The Stack v1 | ~200GB | ✅ Permissive | Legacy compatibility |
CodeParrot Clean | ~50GB | ✅ Apache 2.0 | Educational/Research |
CodeSearchNet | ~5GB | ✅ MIT | Code-documentation pairs |
Python Code Instructions | ~1.2GB | ✅ CC BY 4.0 | Instruction-following |
GitHub Code (Raw) | ~1TB | ⚠️ Mixed | Not recommended |
Performance Benchmarks
Comparative analysis based on StarCoder2 training results demonstrates The Stack v2's superiority:
- Token diversity: 900B tokens vs. 200B tokens (Stack v1)
- Language coverage: 658 languages vs. 358 languages
- Deduplication efficiency: 40% reduction vs. 15% reduction
- License compliance: 100% permissive vs. 85% verified
Conclusion
The Stack v2 represents a paradigm shift in code dataset curation, addressing the fundamental challenges of scale, quality, and legal compliance that have historically limited LLM training effectiveness. Its sophisticated deduplication algorithms, comprehensive license management, and HPC-optimized architecture make it the definitive choice for training production-grade code generation models. While alternative datasets serve specific niche applications, The Stack v2's combination of scale, quality, and technical sophistication establishes it as the new standard for code LLM training datasets.
3. Open-Source UnitTest Datasets for LLM Training on HPC Infrastructure
While Section 2 addresses the acquisition of Python code datasets, developing LLMs for automated test generation requires specialized UnitTest datasets. This section evaluates available open-source UnitTest training data with focus on scalability, quality assurance, and HPC integration for production environments.
The Dataset
CodeRM-UnitTest establishes itself as the leading dataset for training LLMs for automated UnitTest generation. Developed as part of the "Dynamic Scaling of Unit Tests for Code Reward Modeling" research, this dataset addresses the fundamental challenges in generating robust and reliable Python UnitTests.
Technical Specifications
CodeRM-UnitTest encompasses ~77.2K high-quality synthetic Python UnitTests with a total size of 1.7GB in optimized Parquet format. The dataset is based on two prominent code instruction-tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO.
Component | Value | Description |
---|---|---|
Total Size | 1.7GB | Parquet-optimized for HPC streaming |
Number of Tests | ~77.2K | Curated, synthetic UnitTests |
Train/Test Split | 17.6K / 59.6K | Predefined partitioning |
Base Datasets | CodeFeedback + TACO | Established code instruction datasets |
Generation Model | Llama3.1-70B-Instruct | State-of-the-art code generation |
Advanced Quality Assurance and Metrics
The dataset implements a sophisticated quality assurance system with two critical evaluation metrics:
False Acceptance Rate (FAR): Measures the probability that UnitTests incorrectly accept invalid solutions.
False Rejection Rate (FRR): Evaluates the probability that UnitTests incorrectly reject valid solutions.
These metrics are calculated through systematic evaluation with Llama3.1-8B-Instruct generated solutions, enabling quantitative assessment of test robustness.
Licensing and Open-Source Compliance
CodeRM-UnitTest is released under Apache 2.0 License, enabling full commercial usage. Important legal aspects:
- Primary License: Apache 2.0 (fully permissive)
- Base Datasets: Partially MIT-licensed and CC BY 4.0
- Web-crawled Content: CC BY 4.0 compliant
- Commercial Usage: ✅ Fully permitted
- Modification/Redistribution: ✅ Without restrictions
Performance Benchmarks
Based on CodeRM-8B model evaluation, CodeRM-UnitTest demonstrates superior results:
- Test Quality: FAR < 0.15, FRR < 0.10 (average)
- Coverage: 100% Python-focused
- Diversity: Derived from 2 established code datasets
- Filtering: Rigorous quality assurance through ground-truth comparison
- Scalability: HPC-optimized with streaming support
Conclusion
CodeRM-UnitTest represents a paradigm shift in UnitTest dataset curation, focusing on quality, robustness, and practical applicability. The integration of advanced quality metrics (FAR/FRR), combined with HPC-optimized architecture and complete open-source compliance, establishes it as the reference standard for training UnitTest-generating LLMs. While alternative datasets may serve specific niche applications, CodeRM-UnitTest's combination of specialization, quality assurance, and technical sophistication makes it the definitive choice for production environments.
4. Open-Source English Text Datasets for LLM Training on HPC Infrastructure
Most code-centric LLMs still need a large amount of general-purpose English to produce high-quality natural-language responses (explanations, doc-strings, commit messages, …).
Below you find three battle-tested, fully open-source corpora that integrate smoothly with Hugging Face’s streaming stack and therefore with any HPC cluster.
Recommended Datasets
Dataset | Size (Uncompressed) | License | HF Identifier | Content Sources | Preprocessing | HPC Streaming |
---|---|---|---|---|---|---|
The Pile-Uncopyrighted | 825 GB | CC-BY-4.0 | monology/pile-uncopyrighted |
PubMed, ArXiv, GitHub, Wikipedia | Aggressive copyright filtering, deduplication | ✅ Native |
OpenWebText v2 | 38 GB | MIT | Skylion007/openwebtext |
Reddit submissions (≥3 karma) | URL filtering, near-deduplication | ✅ Native |
C4 (en) | 305 GB | CC-BY-SA-4.0 | allenai/c4 |
Common Crawl (April 2019) | Heuristic cleaning, language detection | ✅ Native |
All three sets are publicly redistributable, come with explicit SPDX-compatible licenses, and are hosted on the Hugging Face Hub.
5. Legal Compliance and Licensing
All recommended datasets comply with open-source requirements:
- Commercial Usage: ✅ All datasets permit commercial use
- Redistribution: ✅ Datasets can be redistributed with proper attribution
- Modification: ✅ Datasets can be modified and filtered
- Attribution Requirements: Varies by license (CC-BY requires attribution)
6. Sources
Code Datasets
UnitTest Datasets
- CodeRM-UnitTest
- CodeRM-8B Model
- Dynamic Scaling of Unit Tests for Code Reward Modeling (arXiv:2501.01054)
- CodeRM GitHub Repository