DeepSeek R1:distillation - chunhualiao/public-docs GitHub Wiki

DeepSeek‐R1

In the context of the DeepSeek-R1 paper, the distillation process refers to the transfer of reasoning capabilities from the large, high-performing model (DeepSeek-R1) to smaller, more efficient models that are easier to use and deploy. Let me break this down step by step, including why third-party models like Qwen and Llama were involved:

How the Distillation Process Works

Teacher Model: The process starts with the teacher model, DeepSeek-R1, which is a large, well-trained model with advanced reasoning capabilities.
Data Generation: DeepSeek-R1 generates a dataset of high-quality outputs based on 800,000 curated training samples. These samples include reasoning tasks, factual QA, and other domains.
- These outputs reflect the teacher model's reasoning patterns and problem-solving techniques.
Student Models: The distilled models (referred to as "student models") are smaller, more efficient models trained on the data generated by DeepSeek-R1. These student models are not trained from scratch; they leverage pre-existing architectures such as Qwen and Llama.
Supervised Fine-Tuning (SFT):
- The student models are fine-tuned on the data generated by DeepSeek-R1, aligning their outputs with the reasoning and performance of the teacher model.
- No additional reinforcement learning (RL) is applied to the student models in this paper.

Why Use Third-Party Models (Qwen and Llama)?

Starting Point: The authors used Qwen and Llama as pre-trained base models because they already have robust foundations for reasoning and general-purpose language tasks. This provides a strong starting point, saving significant computational resources and time compared to training a new model architecture from scratch.
Efficiency: These third-party models have efficient and well-optimized architectures that are widely recognized in the research community. By fine-tuning them with the distilled knowledge, the authors can achieve strong reasoning performance while maintaining a smaller model size.
Compatibility: Qwen and Llama offer flexibility and compatibility with existing open-source ecosystems, making the distilled models more accessible to the research community.

Key Insights

The distillation process relies on using the teacher model to train smaller models that are pre-trained but lack the advanced reasoning capabilities of DeepSeek-R1. By fine-tuning Qwen and Llama models with the data generated by DeepSeek-R1, the authors essentially "teach" these smaller models how to reason in a manner similar to the teacher model.
The choice of third-party models is practical: it avoids reinventing the wheel by leveraging already-strong architectures as a foundation for distillation.

Why Not Use DeepSeek’s Own Architecture?

DeepSeek may not have smaller-scale model architectures as optimized as Qwen and Llama, which are established in the field. Using well-known third-party architectures allows them to:

Focus on transferring reasoning capabilities rather than re-developing efficient smaller models.
Offer their work as an open-source contribution compatible with widely adopted models.

This approach demonstrates a practical way to share advanced reasoning capabilities with the community while minimizing development overhead.