Skip to content

Commit 0e434d4

Browse files
[Examples] Kimi-K2-Thinking example (#7988)
* Add Kimi-K2-Thinking * use bash codeblock * Update title * Update llm/kimi-k2-thinking/README.md --------- Co-authored-by: Seung Jin <[email protected]>
1 parent 5e0a44d commit 0e434d4

File tree

6 files changed

+234
-2
lines changed

6 files changed

+234
-2
lines changed

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
----
4040

4141
:fire: *News* :fire:
42+
- [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
4243
- [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
4344
- [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
4445
- [Oct 2025] Run large-scale **LLM training with TorchTitan** on any AI infra: [**example**](./examples/training/torchtitan)
@@ -49,7 +50,7 @@
4950
- [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
5051
- [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
5152
- [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
52-
- [Feb 2025] Prepare and serve **Retrieval Augmented Generation (RAG) with DeepSeek-R1**: [**blog post**](https://blog.skypilot.co/deepseek-rag), [**example**](./llm/rag/)
53+
5354

5455

5556
**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
@@ -183,7 +184,7 @@ Latest featured examples:
183184
|----------|----------|
184185
| Training | [Verl](https://docs.skypilot.co/en/latest/examples/training/verl.html), [Finetune Llama 4](https://docs.skypilot.co/en/latest/examples/training/llama-4-finetuning.html), [TorchTitan](https://docs.skypilot.co/en/latest/examples/training/torchtitan.html), [PyTorch](https://docs.skypilot.co/en/latest/getting-started/tutorial.html), [DeepSpeed](https://docs.skypilot.co/en/latest/examples/training/deepspeed.html), [NeMo](https://docs.skypilot.co/en/latest/examples/training/nemo.html), [Ray](https://docs.skypilot.co/en/latest/examples/training/ray.html), [Unsloth](https://docs.skypilot.co/en/latest/examples/training/unsloth.html), [Jax/TPU](https://docs.skypilot.co/en/latest/examples/training/tpu.html) |
185186
| Serving | [vLLM](https://docs.skypilot.co/en/latest/examples/serving/vllm.html), [SGLang](https://docs.skypilot.co/en/latest/examples/serving/sglang.html), [Ollama](https://docs.skypilot.co/en/latest/examples/serving/ollama.html) |
186-
| Models | [DeepSeek-R1](https://docs.skypilot.co/en/latest/examples/models/deepseek-r1.html), [Llama 4](https://docs.skypilot.co/en/latest/examples/models/llama-4.html), [Llama 3](https://docs.skypilot.co/en/latest/examples/models/llama-3.html), [CodeLlama](https://docs.skypilot.co/en/latest/examples/models/codellama.html), [Qwen](https://docs.skypilot.co/en/latest/examples/models/qwen.html), [Kimi-K2](https://docs.skypilot.co/en/latest/examples/models/kimi-k2.html), [Mixtral](https://docs.skypilot.co/en/latest/examples/models/mixtral.html) |
187+
| Models | [DeepSeek-R1](https://docs.skypilot.co/en/latest/examples/models/deepseek-r1.html), [Llama 4](https://docs.skypilot.co/en/latest/examples/models/llama-4.html), [Llama 3](https://docs.skypilot.co/en/latest/examples/models/llama-3.html), [CodeLlama](https://docs.skypilot.co/en/latest/examples/models/codellama.html), [Qwen](https://docs.skypilot.co/en/latest/examples/models/qwen.html), [Kimi-K2](https://docs.skypilot.co/en/latest/examples/models/kimi-k2.html), [Kimi-K2-Thinking](https://docs.skypilot.co/en/latest/examples/models/kimi-k2-thinking.html), [Mixtral](https://docs.skypilot.co/en/latest/examples/models/mixtral.html) |
187188
| AI apps | [RAG](https://docs.skypilot.co/en/latest/examples/applications/rag.html), [vector databases](https://docs.skypilot.co/en/latest/examples/applications/vector_database.html) (ChromaDB, CLIP) |
188189
| Common frameworks | [Airflow](https://docs.skypilot.co/en/latest/examples/frameworks/airflow.html), [Jupyter](https://docs.skypilot.co/en/latest/examples/frameworks/jupyter.html), [marimo](https://docs.skypilot.co/en/latest/examples/frameworks/marimo.html) |
189190

docs/source/examples/models/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ Models
2020
Mistral 7B <https://docs.mistral.ai/self-deployment/skypilot/>
2121
Qwen 3 <qwen>
2222
Kimi K2 <kimi-k2>
23+
Kimi K2 Thinking <kimi-k2-thinking>
2324
Yi <yi>
2425
Gemma <gemma>
2526
DBRX <dbrx>
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../generated-examples/kimi-k2-thinking.md

llm/kimi-k2-thinking/README.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
2+
<!-- $REMOVE -->
3+
# Run Kimi K2 Thinking on Kubernetes or Any Cloud
4+
<!-- $END_REMOVE -->
5+
<!-- $UNCOMMENT# Kimi K2 Thinking -->
6+
7+
[Kimi K2 Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) is an advanced large language model created by [Moonshot AI](https://www.moonshot.ai/).
8+
9+
This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:
10+
11+
- **Low Latency (TP8)**: Best for interactive applications requiring quick responses
12+
- **High Throughput (TP8+DCP8)**: Best for batch processing and high-volume serving scenarios
13+
14+
15+
## Prerequisites
16+
17+
- Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)).
18+
- Check that `sky check` shows clouds or Kubernetes is enabled.
19+
- **Note**: This model requires 8x H200 or H20 GPUs.
20+
21+
## Run Kimi K2 Thinking (Low Latency Mode)
22+
23+
For low-latency scenarios, use tensor parallelism:
24+
25+
```bash
26+
sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
27+
```
28+
29+
`kimi-k2-thinking.sky.yaml` uses **tensor parallelism** across 8 GPUs for optimal low-latency performance.
30+
31+
🎉 **Congratulations!** 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.
32+
33+
## Run Kimi K2 Thinking (High Throughput Mode)
34+
35+
For high-throughput scenarios, use Decode Context Parallel (DCP) for **43% faster token generation** and **26% higher throughput**:
36+
37+
```bash
38+
sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
39+
```
40+
41+
The `kimi-k2-thinking-high-throughput.sky.yaml` adds `--decode-context-parallel-size 8` to enable DCP:
42+
43+
```yaml
44+
run: |
45+
echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
46+
47+
vllm serve $MODEL_NAME \
48+
--port 8081 \
49+
--tensor-parallel-size 8 \
50+
--decode-context-parallel-size 8 \
51+
--enable-auto-tool-choice \
52+
--tool-call-parser kimi_k2 \
53+
--reasoning-parser kimi_k2 \
54+
--trust-remote-code
55+
```
56+
57+
### DCP Performance Gains
58+
59+
From [vLLM's benchmark](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2-Think.html):
60+
61+
| Metric | TP8 (Low Latency) | TP8+DCP8 (High Throughput) | Improvement |
62+
|--------|-------------------|----------------------------|-------------|
63+
| Request Throughput (req/s) | 1.25 | 1.57 | **+25.6%** |
64+
| Output Token Throughput (tok/s) | 485.78 | 695.13 | **+43.1%** |
65+
| Mean TTFT (sec) | 271.2 | 227.8 | **+16.0%** |
66+
| KV Cache Size (tokens) | 715,072 | 5,721,088 | **8x** |
67+
68+
## Chat with Kimi K2 Thinking with OpenAI API
69+
70+
To curl `/v1/chat/completions`:
71+
72+
```bash
73+
ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)
74+
75+
curl http://$ENDPOINT/v1/chat/completions \
76+
-H "Content-Type: application/json" \
77+
-d '{
78+
"model": "moonshotai/Kimi-K2-Thinking",
79+
"messages": [
80+
{
81+
"role": "system",
82+
"content": "You are a helpful assistant with deep reasoning capabilities."
83+
},
84+
{
85+
"role": "user",
86+
"content": "Explain how to solve the traveling salesman problem for 10 cities."
87+
}
88+
]
89+
}' | jq .
90+
```
91+
92+
The model will provide its reasoning process in the response, showing its chain-of-thought approach.
93+
94+
## Clean up resources
95+
To shut down all resources:
96+
97+
```bash
98+
sky down kimi-k2-thinking
99+
```
100+
101+
## Serving Kimi-K2-Thinking: scaling up with SkyServe
102+
103+
With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:
104+
105+
```bash
106+
sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
107+
```
108+
109+
Wait until the service is ready:
110+
111+
```bash
112+
watch -n10 sky serve status kimi-k2-thinking
113+
```
114+
115+
Get a single endpoint that load-balances across replicas:
116+
117+
```bash
118+
ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)
119+
```
120+
121+
> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
122+
123+
To curl the endpoint:
124+
125+
```bash
126+
curl http://$ENDPOINT/v1/chat/completions \
127+
-H "Content-Type: application/json" \
128+
-d '{
129+
"model": "moonshotai/Kimi-K2-Thinking",
130+
"messages": [
131+
{
132+
"role": "system",
133+
"content": "You are a helpful assistant with deep reasoning capabilities."
134+
},
135+
{
136+
"role": "user",
137+
"content": "Design a distributed system for real-time analytics."
138+
}
139+
]
140+
}' | jq .
141+
```
142+
143+
To shut down all resources:
144+
145+
```bash
146+
sky serve down kimi-k2-thinking
147+
```
148+
149+
See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html).
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Serve Kimi-K2-Thinking with SkyPilot and vLLM (High Throughput Mode).
2+
# Uses Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput.
3+
#
4+
# Usage:
5+
# sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
6+
# sky serve up kimi-k2-thinking-high-throughput.sky.yaml -n kimi-k2-thinking-ht
7+
envs:
8+
MODEL_NAME: moonshotai/Kimi-K2-Thinking
9+
10+
11+
resources:
12+
image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
13+
accelerators: H200:8
14+
cpus: 100+
15+
memory: 1000+
16+
ports: 8081
17+
18+
run: |
19+
echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
20+
21+
vllm serve $MODEL_NAME \
22+
--port 8081 \
23+
--tensor-parallel-size 8 \
24+
--decode-context-parallel-size 8 \
25+
--enable-auto-tool-choice \
26+
--tool-call-parser kimi_k2 \
27+
--reasoning-parser kimi_k2 \
28+
--trust-remote-code
29+
30+
service:
31+
replicas: 1
32+
# An actual request for readiness probe.
33+
readiness_probe:
34+
path: /v1/chat/completions
35+
post_data:
36+
model: $MODEL_NAME
37+
messages:
38+
- role: user
39+
content: What is 2+2?
40+
max_tokens: 10
41+
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Serve Kimi-K2-Thinking with SkyPilot and vLLM (Low Latency Mode).
2+
# This model supports deep thinking & tool orchestration with reasoning capabilities.
3+
#
4+
# Usage:
5+
# sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
6+
# sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
7+
envs:
8+
MODEL_NAME: moonshotai/Kimi-K2-Thinking
9+
10+
resources:
11+
image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
12+
accelerators: H200:8
13+
cpus: 100+
14+
memory: 1000+
15+
ports: 8081
16+
17+
run: |
18+
echo 'Starting vLLM API server for Kimi-K2-Thinking (Low Latency Mode)...'
19+
20+
vllm serve $MODEL_NAME \
21+
--port 8081 \
22+
--tensor-parallel-size 8 \
23+
--enable-auto-tool-choice \
24+
--tool-call-parser kimi_k2 \
25+
--reasoning-parser kimi_k2 \
26+
--trust-remote-code
27+
28+
service:
29+
replicas: 1
30+
# An actual request for readiness probe.
31+
readiness_probe:
32+
path: /v1/chat/completions
33+
post_data:
34+
model: $MODEL_NAME
35+
messages:
36+
- role: user
37+
content: What is 2+2?
38+
max_tokens: 10
39+

0 commit comments

Comments
 (0)