Choosing the Right Dataset for LLM Training on the University HPC - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

1. Why Hugging Face Beats Kaggle

When selecting a dataset platform for training Large Language Models (LLMs) on HPC infrastructure, HuggingFace emerges as the superior choice over Kaggle. This section outlines the key differences and explains why HuggingFace aligns better with professional ML development requirements.

Platform Comparison

Aspect Kaggle HuggingFace
Download Method ZIP/CSV files Streaming API
Dataset Size Usually < 10GB Up to TB scale
Licensing Often unclear, optional Explicitly filtered, mandatory
Preprocessing Optional, often missing Mandatory field
Updates Rare, Snapshots Regular, Continuous versioning
Integration Pandas/Manual Native datasets library
Memory Efficiency Load entire dataset Stream on-demand

HPC Optimization

  • Streaming: Process data without downloading entire dataset
  • Memory Efficiency: Load only required batches into memory
  • Checkpoint Support: Resume interrupted training seamlessly

Quick-Start

Dependencies

pip install datasets huggingface_hub

Authenticate

  1. Create Huggingface account at https://huggingface.co/
  2. Create Huggingface token at https://huggingface.co/settings/tokens
  3. Run huggingface-cli login and paste your personal access token

Code Example with small dataset

from datasets import load_dataset

dataset = load_dataset(
    "code_search_net",
    "python",                 # Language filter
    split="train",
    streaming=True,            # Enable streaming for HPC
    trust_remote_code=True
)

# Fetch a small batch to demonstrate
batch = [next(iter(dataset))]
print(f"Loaded {len(batch)} examples")
print(f"Available fields: {list(batch[0].keys())}")

Conclusion

While Kaggle serves well for exploratory data analysis and competitions, HuggingFace provides the infrastructure, legal clarity, and technical capabilities required for training production-grade LLMs on HPC systems. The streaming capabilities alone make it indispensable for handling the massive datasets needed for effective code generation models.


2. Open-Source Python Datasets for LLM Training on HPC Infrastructure

When developing Large Language Models (LLMs) for code generation on High-Performance Computing (HPC) infrastructure, dataset selection represents a critical architectural decision. This section evaluates the current landscape of open-source Python training datasets, with particular emphasis on scalability, legal compliance, and technical integration requirements for production HPC environments.

The Dataset

The Stack v2 emerges as the superior dataset for training production-grade code LLMs, representing the next generation of curated code datasets. Developed as part of the BigCode Project, this dataset addresses the fundamental limitations of previous code collections through advanced deduplication techniques, comprehensive license verification, and unprecedented scale.

Technical Specifications

The Stack v2 encompasses 67.5TB of total data containing over 3.28B unique files from 104.2M GitHub repositories, collected through systematic traversal of the Software Heritage 2023-09-06 graph dataset. The dataset provides four distinct variants optimized for different training scenarios:

Variant Size Description Use Case
bigcode/the-stack-v2 67.5TB Complete dataset with file IDs Full-scale training
bigcode/the-stack-v2-dedup 32.1TB Near-deduplicated version Quality-focused training
bigcode/the-stack-v2-train-full-ids ~900B tokens 600+ languages, repository-grouped Production training
bigcode/the-stack-v2-train-smol-ids ~200B tokens 17 core languages, repository-grouped Efficient training

Advanced Deduplication and Quality Assurance

The dataset implements near-deduplication on top of exact deduplication, removing approximately 40% of permissively licensed files identified as duplicates. This process significantly enhances training efficiency by reducing redundant patterns while preserving linguistic diversity across the 658 supported programming languages.

Comprehensive License Management

The Stack v2 implements a sophisticated multi-tier license detection system:

  1. Repository-level extraction from GitHub Archive data
  2. File-level analysis using ScanCode Toolkit for 96.93% of repositories lacking explicit licenses
  3. Permissive license filtering based on Blue Oak Council standards and ScanCode categorization
  4. Propagation mechanisms for license inheritance within repository hierarchies

All included repositories comply with permissive open-source licenses (MIT, Apache, BSD, etc.), ensuring legal compliance for commercial applications.

Ethical Considerations and Opt-Out Mechanisms

The Stack v2 implements comprehensive data governance through:

  • "Am I In The Stack?" verification system
  • Proactive opt-out mechanisms for developers
  • Regular updates removing opted-out repositories
  • PII minimization through deduplication processes

Alternative Datasets: Comparative Analysis

While The Stack v2 represents the state-of-the-art, specific use cases may benefit from alternative datasets. The following analysis provides a comprehensive comparison of available options:

Dataset Size (Python) License Compliance Recommended Use Case
The Stack v2 ~200GB ✅ Comprehensive Production LLM Training
StarCoder2 Data ~35GB ✅ Apache 2.0 Rapid Prototyping
The Stack v1 ~200GB ✅ Permissive Legacy compatibility
CodeParrot Clean ~50GB ✅ Apache 2.0 Educational/Research
CodeSearchNet ~5GB ✅ MIT Code-documentation pairs
Python Code Instructions ~1.2GB ✅ CC BY 4.0 Instruction-following
GitHub Code (Raw) ~1TB ⚠️ Mixed Not recommended

Performance Benchmarks

Comparative analysis based on StarCoder2 training results demonstrates The Stack v2's superiority:

  • Token diversity: 900B tokens vs. 200B tokens (Stack v1)
  • Language coverage: 658 languages vs. 358 languages
  • Deduplication efficiency: 40% reduction vs. 15% reduction
  • License compliance: 100% permissive vs. 85% verified

Conclusion

The Stack v2 represents a paradigm shift in code dataset curation, addressing the fundamental challenges of scale, quality, and legal compliance that have historically limited LLM training effectiveness. Its sophisticated deduplication algorithms, comprehensive license management, and HPC-optimized architecture make it the definitive choice for training production-grade code generation models. While alternative datasets serve specific niche applications, The Stack v2's combination of scale, quality, and technical sophistication establishes it as the new standard for code LLM training datasets.


3. Open-Source UnitTest Datasets for LLM Training on HPC Infrastructure

While Section 2 addresses the acquisition of Python code datasets, developing LLMs for automated test generation requires specialized UnitTest datasets. This section evaluates available open-source UnitTest training data with focus on scalability, quality assurance, and HPC integration for production environments.

The Dataset

CodeRM-UnitTest establishes itself as the leading dataset for training LLMs for automated UnitTest generation. Developed as part of the "Dynamic Scaling of Unit Tests for Code Reward Modeling" research, this dataset addresses the fundamental challenges in generating robust and reliable Python UnitTests.

Technical Specifications

CodeRM-UnitTest encompasses ~77.2K high-quality synthetic Python UnitTests with a total size of 1.7GB in optimized Parquet format. The dataset is based on two prominent code instruction-tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO.

Component Value Description
Total Size 1.7GB Parquet-optimized for HPC streaming
Number of Tests ~77.2K Curated, synthetic UnitTests
Train/Test Split 17.6K / 59.6K Predefined partitioning
Base Datasets CodeFeedback + TACO Established code instruction datasets
Generation Model Llama3.1-70B-Instruct State-of-the-art code generation

Advanced Quality Assurance and Metrics

The dataset implements a sophisticated quality assurance system with two critical evaluation metrics:

False Acceptance Rate (FAR): Measures the probability that UnitTests incorrectly accept invalid solutions.

False Rejection Rate (FRR): Evaluates the probability that UnitTests incorrectly reject valid solutions.

These metrics are calculated through systematic evaluation with Llama3.1-8B-Instruct generated solutions, enabling quantitative assessment of test robustness.

Licensing and Open-Source Compliance

CodeRM-UnitTest is released under Apache 2.0 License, enabling full commercial usage. Important legal aspects:

  • Primary License: Apache 2.0 (fully permissive)
  • Base Datasets: Partially MIT-licensed and CC BY 4.0
  • Web-crawled Content: CC BY 4.0 compliant
  • Commercial Usage: ✅ Fully permitted
  • Modification/Redistribution: ✅ Without restrictions

Performance Benchmarks

Based on CodeRM-8B model evaluation, CodeRM-UnitTest demonstrates superior results:

  • Test Quality: FAR < 0.15, FRR < 0.10 (average)
  • Coverage: 100% Python-focused
  • Diversity: Derived from 2 established code datasets
  • Filtering: Rigorous quality assurance through ground-truth comparison
  • Scalability: HPC-optimized with streaming support

Conclusion

CodeRM-UnitTest represents a paradigm shift in UnitTest dataset curation, focusing on quality, robustness, and practical applicability. The integration of advanced quality metrics (FAR/FRR), combined with HPC-optimized architecture and complete open-source compliance, establishes it as the reference standard for training UnitTest-generating LLMs. While alternative datasets may serve specific niche applications, CodeRM-UnitTest's combination of specialization, quality assurance, and technical sophistication makes it the definitive choice for production environments.


4. Open-Source English Text Datasets for LLM Training on HPC Infrastructure

Most code-centric LLMs still need a large amount of general-purpose English to produce high-quality natural-language responses (explanations, doc-strings, commit messages, …).
Below you find three battle-tested, fully open-source corpora that integrate smoothly with Hugging Face’s streaming stack and therefore with any HPC cluster.

Recommended Datasets

Dataset Size (Uncompressed) License HF Identifier Content Sources Preprocessing HPC Streaming
The Pile-Uncopyrighted 825 GB CC-BY-4.0 monology/pile-uncopyrighted PubMed, ArXiv, GitHub, Wikipedia Aggressive copyright filtering, deduplication ✅ Native
OpenWebText v2 38 GB MIT Skylion007/openwebtext Reddit submissions (≥3 karma) URL filtering, near-deduplication ✅ Native
C4 (en) 305 GB CC-BY-SA-4.0 allenai/c4 Common Crawl (April 2019) Heuristic cleaning, language detection ✅ Native

All three sets are publicly redistributable, come with explicit SPDX-compatible licenses, and are hosted on the Hugging Face Hub.


5. Legal Compliance and Licensing

All recommended datasets comply with open-source requirements:

  • Commercial Usage: ✅ All datasets permit commercial use
  • Redistribution: ✅ Datasets can be redistributed with proper attribution
  • Modification: ✅ Datasets can be modified and filtered
  • Attribution Requirements: Varies by license (CC-BY requires attribution)

6. Sources

Code Datasets

UnitTest Datasets

Alternative Code Datasets

English Text Datasets

Related Resources