Choosing the Right Dataset for LLM Training on the University HPC - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

1. Why Hugging Face Beats Kaggle

When selecting a dataset platform for training Large Language Models (LLMs) on HPC infrastructure, HuggingFace emerges as the superior choice over Kaggle. This section outlines the key differences and explains why HuggingFace aligns better with professional ML development requirements.

Platform Comparison

Aspect	Kaggle	HuggingFace
Download Method	ZIP/CSV files	Streaming API
Dataset Size	Usually < 10GB	Up to TB scale
Licensing	Often unclear, optional	Explicitly filtered, mandatory
Preprocessing	Optional, often missing	Mandatory field
Updates	Rare, Snapshots	Regular, Continuous versioning
Integration	Pandas/Manual	Native `datasets` library
Memory Efficiency	Load entire dataset	Stream on-demand

HPC Optimization

Streaming: Process data without downloading entire dataset
Memory Efficiency: Load only required batches into memory
Checkpoint Support: Resume interrupted training seamlessly

Quick-Start

Dependencies

pip install datasets huggingface_hub

Authenticate

Create Huggingface account at https://huggingface.co/
Create Huggingface token at https://huggingface.co/settings/tokens
Run huggingface-cli login and paste your personal access token

Code Example with small dataset

from datasets import load_dataset

dataset = load_dataset(
    "code_search_net",
    "python",                 # Language filter
    split="train",
    streaming=True,            # Enable streaming for HPC
    trust_remote_code=True
)

# Fetch a small batch to demonstrate
batch = [next(iter(dataset))]
print(f"Loaded {len(batch)} examples")
print(f"Available fields: {list(batch[0].keys())}")

Conclusion

While Kaggle serves well for exploratory data analysis and competitions, HuggingFace provides the infrastructure, legal clarity, and technical capabilities required for training production-grade LLMs on HPC systems. The streaming capabilities alone make it indispensable for handling the massive datasets needed for effective code generation models.

2. Open-Source Python Datasets for LLM Training on HPC Infrastructure

When developing Large Language Models (LLMs) for code generation on High-Performance Computing (HPC) infrastructure, dataset selection represents a critical architectural decision. This section evaluates the current landscape of open-source Python training datasets, with particular emphasis on scalability, legal compliance, and technical integration requirements for production HPC environments.

The Dataset

The Stack v2 emerges as the superior dataset for training production-grade code LLMs, representing the next generation of curated code datasets. Developed as part of the BigCode Project, this dataset addresses the fundamental limitations of previous code collections through advanced deduplication techniques, comprehensive license verification, and unprecedented scale.

Technical Specifications

The Stack v2 encompasses 67.5TB of total data containing over 3.28B unique files from 104.2M GitHub repositories, collected through systematic traversal of the Software Heritage 2023-09-06 graph dataset. The dataset provides four distinct variants optimized for different training scenarios:

Variant	Size	Description	Use Case
`bigcode/the-stack-v2`	67.5TB	Complete dataset with file IDs	Full-scale training
`bigcode/the-stack-v2-dedup`	32.1TB	Near-deduplicated version	Quality-focused training
`bigcode/the-stack-v2-train-full-ids`	~900B tokens	600+ languages, repository-grouped	Production training
`bigcode/the-stack-v2-train-smol-ids`	~200B tokens	17 core languages, repository-grouped	Efficient training

Advanced Deduplication and Quality Assurance

The dataset implements near-deduplication on top of exact deduplication, removing approximately 40% of permissively licensed files identified as duplicates. This process significantly enhances training efficiency by reducing redundant patterns while preserving linguistic diversity across the 658 supported programming languages.

Comprehensive License Management

The Stack v2 implements a sophisticated multi-tier license detection system:

Repository-level extraction from GitHub Archive data
File-level analysis using ScanCode Toolkit for 96.93% of repositories lacking explicit licenses
Permissive license filtering based on Blue Oak Council standards and ScanCode categorization
Propagation mechanisms for license inheritance within repository hierarchies

All included repositories comply with permissive open-source licenses (MIT, Apache, BSD, etc.), ensuring legal compliance for commercial applications.

Ethical Considerations and Opt-Out Mechanisms

The Stack v2 implements comprehensive data governance through:

"Am I In The Stack?" verification system
Proactive opt-out mechanisms for developers
Regular updates removing opted-out repositories
PII minimization through deduplication processes

Alternative Datasets: Comparative Analysis

While The Stack v2 represents the state-of-the-art, specific use cases may benefit from alternative datasets. The following analysis provides a comprehensive comparison of available options:

Dataset	Size (Python)	License Compliance	Recommended Use Case
The Stack v2	~200GB	✅ Comprehensive	Production LLM Training
StarCoder2 Data	~35GB	✅ Apache 2.0	Rapid Prototyping
The Stack v1	~200GB	✅ Permissive	Legacy compatibility
CodeParrot Clean	~50GB	✅ Apache 2.0	Educational/Research
CodeSearchNet	~5GB	✅ MIT	Code-documentation pairs
Python Code Instructions	~1.2GB	✅ CC BY 4.0	Instruction-following
GitHub Code (Raw)	~1TB	⚠️ Mixed	Not recommended

Performance Benchmarks

Comparative analysis based on StarCoder2 training results demonstrates The Stack v2's superiority:

Token diversity: 900B tokens vs. 200B tokens (Stack v1)
Language coverage: 658 languages vs. 358 languages
Deduplication efficiency: 40% reduction vs. 15% reduction
License compliance: 100% permissive vs. 85% verified

Conclusion

The Stack v2 represents a paradigm shift in code dataset curation, addressing the fundamental challenges of scale, quality, and legal compliance that have historically limited LLM training effectiveness. Its sophisticated deduplication algorithms, comprehensive license management, and HPC-optimized architecture make it the definitive choice for training production-grade code generation models. While alternative datasets serve specific niche applications, The Stack v2's combination of scale, quality, and technical sophistication establishes it as the new standard for code LLM training datasets.

3. Open-Source UnitTest Datasets for LLM Training on HPC Infrastructure

While Section 2 addresses the acquisition of Python code datasets, developing LLMs for automated test generation requires specialized UnitTest datasets. This section evaluates available open-source UnitTest training data with focus on scalability, quality assurance, and HPC integration for production environments.

The Dataset

CodeRM-UnitTest establishes itself as the leading dataset for training LLMs for automated UnitTest generation. Developed as part of the "Dynamic Scaling of Unit Tests for Code Reward Modeling" research, this dataset addresses the fundamental challenges in generating robust and reliable Python UnitTests.

Technical Specifications

CodeRM-UnitTest encompasses ~77.2K high-quality synthetic Python UnitTests with a total size of 1.7GB in optimized Parquet format. The dataset is based on two prominent code instruction-tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO.

Component	Value	Description
Total Size	1.7GB	Parquet-optimized for HPC streaming
Number of Tests	~77.2K	Curated, synthetic UnitTests
Train/Test Split	17.6K / 59.6K	Predefined partitioning
Base Datasets	CodeFeedback + TACO	Established code instruction datasets
Generation Model	Llama3.1-70B-Instruct	State-of-the-art code generation

Advanced Quality Assurance and Metrics

The dataset implements a sophisticated quality assurance system with two critical evaluation metrics:

False Acceptance Rate (FAR): Measures the probability that UnitTests incorrectly accept invalid solutions.

False Rejection Rate (FRR): Evaluates the probability that UnitTests incorrectly reject valid solutions.

These metrics are calculated through systematic evaluation with Llama3.1-8B-Instruct generated solutions, enabling quantitative assessment of test robustness.

Licensing and Open-Source Compliance

CodeRM-UnitTest is released under Apache 2.0 License, enabling full commercial usage. Important legal aspects:

Primary License: Apache 2.0 (fully permissive)
Base Datasets: Partially MIT-licensed and CC BY 4.0
Web-crawled Content: CC BY 4.0 compliant
Commercial Usage: ✅ Fully permitted
Modification/Redistribution: ✅ Without restrictions

Performance Benchmarks

Based on CodeRM-8B model evaluation, CodeRM-UnitTest demonstrates superior results:

Test Quality: FAR < 0.15, FRR < 0.10 (average)
Coverage: 100% Python-focused
Diversity: Derived from 2 established code datasets
Filtering: Rigorous quality assurance through ground-truth comparison
Scalability: HPC-optimized with streaming support

Conclusion

CodeRM-UnitTest represents a paradigm shift in UnitTest dataset curation, focusing on quality, robustness, and practical applicability. The integration of advanced quality metrics (FAR/FRR), combined with HPC-optimized architecture and complete open-source compliance, establishes it as the reference standard for training UnitTest-generating LLMs. While alternative datasets may serve specific niche applications, CodeRM-UnitTest's combination of specialization, quality assurance, and technical sophistication makes it the definitive choice for production environments.

4. Open-Source English Text Datasets for LLM Training on HPC Infrastructure

Most code-centric LLMs still need a large amount of general-purpose English to produce high-quality natural-language responses (explanations, doc-strings, commit messages, …).
Below you find three battle-tested, fully open-source corpora that integrate smoothly with Hugging Face’s streaming stack and therefore with any HPC cluster.

Recommended Datasets

Dataset	Size (Uncompressed)	License	HF Identifier	Content Sources	Preprocessing	HPC Streaming
The Pile-Uncopyrighted	825 GB	CC-BY-4.0	`monology/pile-uncopyrighted`	PubMed, ArXiv, GitHub, Wikipedia	Aggressive copyright filtering, deduplication	✅ Native
OpenWebText v2	38 GB	MIT	`Skylion007/openwebtext`	Reddit submissions (≥3 karma)	URL filtering, near-deduplication	✅ Native
C4 (en)	305 GB	CC-BY-SA-4.0	`allenai/c4`	Common Crawl (April 2019)	Heuristic cleaning, language detection	✅ Native

All three sets are publicly redistributable, come with explicit SPDX-compatible licenses, and are hosted on the Hugging Face Hub.

Choosing the Right Dataset for LLM Training on the University HPC - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

1. Why Hugging Face Beats Kaggle

Platform Comparison

HPC Optimization

Quick-Start

Dependencies

Authenticate

Code Example with small dataset

Conclusion

2. Open-Source Python Datasets for LLM Training on HPC Infrastructure

The Dataset

Technical Specifications

Advanced Deduplication and Quality Assurance

Comprehensive License Management

Ethical Considerations and Opt-Out Mechanisms

Alternative Datasets: Comparative Analysis

Performance Benchmarks

Conclusion

3. Open-Source UnitTest Datasets for LLM Training on HPC Infrastructure

The Dataset

Technical Specifications

Advanced Quality Assurance and Metrics

Licensing and Open-Source Compliance

Performance Benchmarks

Conclusion

4. Open-Source English Text Datasets for LLM Training on HPC Infrastructure

Recommended Datasets

5. Legal Compliance and Licensing

6. Sources

Code Datasets

UnitTest Datasets

Alternative Code Datasets

English Text Datasets

Related Resources