DeepSeek R1 阅读清单

DeepSeek R1 相关资料，全部被我个人阅读并精选，不是简单的罗列。

> 更新时间：2025.3.1

- 文章
	- [Reasoning best practices](https://platform.openai.com/docs/guides/reasoning-best-practices)：**【重点】** OpenAI 的思考模型最佳实践，必看。
	- Greg 的 思考模型 Prompt：![Image](https://github.com/user-attachments/assets/9ea9e1ea-a1b0-4971-9a27-a6a70b19541b)
	- [Understanding Reasoning LLMs](https://magazine.sebastianraschka.com/p/understanding-reasoning-llms)：偏学术一些的文章。
	- [A Visual Guide to Reasoning LLMs](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms)：**【重点】** 非常棒的介绍，可视化做的很好。
	- [DeepSeek-R1 Dissection: Understanding PPO & GRPO Without Any Prior Reinforcement Learning Knowledge](https://huggingface.co/blog/NormalUhr/grpo)：GRPO 算法的非数学理解，适合非算法方向的。
- 论文
	- [DeepSeek R1](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf)：**【重点】** DeepSeek R1 本体论文，写的引人入胜。
	- [Kimi K1.5](https://arxiv.org/pdf/2501.12599v1)：Kimi K1.5 推理模型的思路和 R1 类似，在数据和奖励函数上有更多的细节。
	- [DeepSeek Math](https://arxiv.org/pdf/2402.03300)：GRPO 算法的提出，GRPO 相比于 PPO 节约了 Value Model，从而降低了训练的显存要求。
	- [SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training](https://arxiv.org/abs/2501.17161)：对 SFT 和 RL 效果和应用方向的研究，但是结论仅供参考，还需要大量的实践。
	- 最近有大量的 Reasoning Model 论文，但是经得起时间考验还没有，后续随着阅读逐渐增加。
- GRPO 开源实现：主要是要支持 reward function。
	- [trl grpo trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)：TRL 的 GRPOTrainer 实现
	- [veRL](https://github.com/volcengine/verl)：字节开源的 RL 实现，也支持 GRPO reward function。
	- [Unsloth](https://docs.unsloth.ai/basics/reasoning-grpo-and-rl)：**【重点】** Unsloth 的 GRPO 实现，可大幅减少显存使用。
	- [verifiers](https://github.com/willccbb/verifiers)：封装好的一些验证器。
- R1 复刻项目、数据集
	- [open-r1](https://github.com/huggingface/open-r1/)：**【重点】** 包括数据合成、SFT、GRPO RL 的代码。
	- [TinyZero](https://github.com/Jiayi-Pan/TinyZero)：在简单的类24点问题上复刻 R1 RL 范式。
	- [SkyT1](https://github.com/NovaSky-AI/SkyThought)：蒸馏的 QwQ 的数据实现的 o1-like 模型。
	- [HuatuoGPT-o1](https://github.com/FreedomIntelligence/HuatuoGPT-o1)：医学领域复刻 o1（开放代码、数据、论文和模型），但是用的还是 reward model，效果提升很少。可以用 R1 RL 范式看看能否有明显提升。
	- [simpleRL-reason](https://github.com/hkust-nlp/simpleRL-reason)：**【重点】** 在 8k MATH 数据集上复刻 R1-Zero 的范式
	- [open-r1-multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)：R1 多模态的复刻项目
	- [open-thoughts](https://github.com/open-thoughts/open-thoughts)：**【重点】** 最成熟的 R1 复刻项目，已经发布了 [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) 和 [OpenThoughts-114k dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 项目，仅经过 SFT 即可以逼近 R1-distill 模型
	- [R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)：1.68M 条 R1 蒸馏数据集
	- [Chinese-DeepSeek-R1-Distill-data-110k](https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k)：**【重点】** 中文蒸馏数据集

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek R1 阅读清单 #121

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

DeepSeek R1 阅读清单 #121

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions