[Examples] Kimi-K2-Thinking example (#7988)

romilbhardwaj · SeungjinYang · web-flow · commit 0e434d48454f · 2025-11-25T18:18:45.000-08:00
* Add Kimi-K2-Thinking

* use bash codeblock

* Update title

* Update llm/kimi-k2-thinking/README.md

---------

Co-authored-by: Seung Jin &lt;seungjin219@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -39,6 +39,7 @@
 ----
 
 :fire: *News* :fire:
+- [Nov 2025] Serve **Kimi K2 Thinking** with reasoning capabilities on your Kubernetes or clouds: [**example**](./llm/kimi-k2-thinking/)
 - [Oct 2025] Run **RL training for LLMs** with SkyRL on your Kubernetes or clouds: [**example**](./llm/skyrl/)
 - [Oct 2025] Train and serve [Andrej Karpathy's](https://x.com/karpathy/status/1977755427569111362) **nanochat** - the best ChatGPT that $100 can buy: [**example**](./llm/nanochat)
 - [Oct 2025] Run large-scale **LLM training with TorchTitan** on any AI infra: [**example**](./examples/training/torchtitan)
@@ -49,7 +50,7 @@
 - [Jul 2025] Finetune **Llama4** on any distributed cluster/cloud: [**example**](./llm/llama-4-finetuning/)
 - [Jul 2025] Two-part blog series, `The Evolution of AI Job Orchestration`: (1) [Running AI jobs on GPU Neoclouds](https://blog.skypilot.co/ai-job-orchestration-pt1-gpu-neoclouds/), (2) [The AI-Native Control Plane & Orchestration that Finally Works for ML](https://blog.skypilot.co/ai-job-orchestration-pt2-ai-control-plane/)
 - [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)
-- [Feb 2025] Prepare and serve **Retrieval Augmented Generation (RAG) with DeepSeek-R1**: [**blog post**](https://blog.skypilot.co/deepseek-rag), [**example**](./llm/rag/)
+
 
 
 **LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)
@@ -183,7 +184,7 @@ Latest featured examples:
 |----------|----------|
 | Training | [Verl](https://docs.skypilot.co/en/latest/examples/training/verl.html), [Finetune Llama 4](https://docs.skypilot.co/en/latest/examples/training/llama-4-finetuning.html), [TorchTitan](https://docs.skypilot.co/en/latest/examples/training/torchtitan.html), [PyTorch](https://docs.skypilot.co/en/latest/getting-started/tutorial.html), [DeepSpeed](https://docs.skypilot.co/en/latest/examples/training/deepspeed.html), [NeMo](https://docs.skypilot.co/en/latest/examples/training/nemo.html), [Ray](https://docs.skypilot.co/en/latest/examples/training/ray.html), [Unsloth](https://docs.skypilot.co/en/latest/examples/training/unsloth.html), [Jax/TPU](https://docs.skypilot.co/en/latest/examples/training/tpu.html) |
 | Serving | [vLLM](https://docs.skypilot.co/en/latest/examples/serving/vllm.html), [SGLang](https://docs.skypilot.co/en/latest/examples/serving/sglang.html), [Ollama](https://docs.skypilot.co/en/latest/examples/serving/ollama.html) |
-| Models | [DeepSeek-R1](https://docs.skypilot.co/en/latest/examples/models/deepseek-r1.html), [Llama 4](https://docs.skypilot.co/en/latest/examples/models/llama-4.html), [Llama 3](https://docs.skypilot.co/en/latest/examples/models/llama-3.html), [CodeLlama](https://docs.skypilot.co/en/latest/examples/models/codellama.html), [Qwen](https://docs.skypilot.co/en/latest/examples/models/qwen.html), [Kimi-K2](https://docs.skypilot.co/en/latest/examples/models/kimi-k2.html), [Mixtral](https://docs.skypilot.co/en/latest/examples/models/mixtral.html) |
+| Models | [DeepSeek-R1](https://docs.skypilot.co/en/latest/examples/models/deepseek-r1.html), [Llama 4](https://docs.skypilot.co/en/latest/examples/models/llama-4.html), [Llama 3](https://docs.skypilot.co/en/latest/examples/models/llama-3.html), [CodeLlama](https://docs.skypilot.co/en/latest/examples/models/codellama.html), [Qwen](https://docs.skypilot.co/en/latest/examples/models/qwen.html), [Kimi-K2](https://docs.skypilot.co/en/latest/examples/models/kimi-k2.html), [Kimi-K2-Thinking](https://docs.skypilot.co/en/latest/examples/models/kimi-k2-thinking.html), [Mixtral](https://docs.skypilot.co/en/latest/examples/models/mixtral.html) |
 | AI apps | [RAG](https://docs.skypilot.co/en/latest/examples/applications/rag.html), [vector databases](https://docs.skypilot.co/en/latest/examples/applications/vector_database.html) (ChromaDB, CLIP) |
 | Common frameworks | [Airflow](https://docs.skypilot.co/en/latest/examples/frameworks/airflow.html), [Jupyter](https://docs.skypilot.co/en/latest/examples/frameworks/jupyter.html), [marimo](https://docs.skypilot.co/en/latest/examples/frameworks/marimo.html)  |
 
diff --git a/docs/source/examples/models/index.rst b/docs/source/examples/models/index.rst
@@ -20,6 +20,7 @@ Models
    Mistral 7B <https://docs.mistral.ai/self-deployment/skypilot/>
    Qwen 3 <qwen>
    Kimi K2 <kimi-k2>
+   Kimi K2 Thinking <kimi-k2-thinking>
    Yi <yi>
    Gemma <gemma>
    DBRX <dbrx>
diff --git a/docs/source/examples/models/kimi-k2-thinking.md b/docs/source/examples/models/kimi-k2-thinking.md
@@ -0,0 +1 @@
+../../generated-examples/kimi-k2-thinking.md
diff --git a/llm/kimi-k2-thinking/README.md b/llm/kimi-k2-thinking/README.md
@@ -0,0 +1,149 @@
+
+<!-- $REMOVE -->
+# Run Kimi K2 Thinking on Kubernetes or Any Cloud
+<!-- $END_REMOVE -->
+<!-- $UNCOMMENT# Kimi K2 Thinking -->
+
+[Kimi K2 Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) is an advanced large language model created by [Moonshot AI](https://www.moonshot.ai/).
+
+This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes:
+
+- **Low Latency (TP8)**: Best for interactive applications requiring quick responses
+- **High Throughput (TP8+DCP8)**: Best for batch processing and high-volume serving scenarios
+
+
+## Prerequisites
+
+- Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)).
+- Check that `sky check` shows clouds or Kubernetes is enabled.
+- **Note**: This model requires 8x H200 or H20 GPUs.
+
+## Run Kimi K2 Thinking (Low Latency Mode)
+
+For low-latency scenarios, use tensor parallelism:
+
+```bash
+sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
+```
+
+`kimi-k2-thinking.sky.yaml` uses **tensor parallelism** across 8 GPUs for optimal low-latency performance.
+
+🎉 **Congratulations!** 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra.
+
+## Run Kimi K2 Thinking (High Throughput Mode)
+
+For high-throughput scenarios, use Decode Context Parallel (DCP) for **43% faster token generation** and **26% higher throughput**:
+
+```bash
+sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
+```
+
+The `kimi-k2-thinking-high-throughput.sky.yaml` adds `--decode-context-parallel-size 8` to enable DCP:
+
+```yaml
+run: |
+  echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
+  
+  vllm serve $MODEL_NAME \
+    --port 8081 \
+    --tensor-parallel-size 8 \
+    --decode-context-parallel-size 8 \
+    --enable-auto-tool-choice \
+    --tool-call-parser kimi_k2 \
+    --reasoning-parser kimi_k2 \
+    --trust-remote-code
+```
+
+### DCP Performance Gains
+
+From [vLLM's benchmark](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2-Think.html):
+
+| Metric | TP8 (Low Latency) | TP8+DCP8 (High Throughput) | Improvement |
+|--------|-------------------|----------------------------|-------------|
+| Request Throughput (req/s) | 1.25 | 1.57 | **+25.6%** |
+| Output Token Throughput (tok/s) | 485.78 | 695.13 | **+43.1%** |
+| Mean TTFT (sec) | 271.2 | 227.8 | **+16.0%** |
+| KV Cache Size (tokens) | 715,072 | 5,721,088 | **8x** |
+
+## Chat with Kimi K2 Thinking with OpenAI API
+
+To curl `/v1/chat/completions`:
+
+```bash
+ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking)
+
+curl http://$ENDPOINT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "moonshotai/Kimi-K2-Thinking",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant with deep reasoning capabilities."
+      },
+      {
+        "role": "user",
+        "content": "Explain how to solve the traveling salesman problem for 10 cities."
+      }
+    ]
+  }' | jq .
+```
+
+The model will provide its reasoning process in the response, showing its chain-of-thought approach.
+
+## Clean up resources
+To shut down all resources:
+
+```bash
+sky down kimi-k2-thinking
+```
+
+## Serving Kimi-K2-Thinking: scaling up with SkyServe
+
+With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra:
+
+```bash
+sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
+```
+
+Wait until the service is ready:
+
+```bash
+watch -n10 sky serve status kimi-k2-thinking
+```
+
+Get a single endpoint that load-balances across replicas:
+
+```bash
+ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking)
+```
+
+> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
+
+To curl the endpoint:
+
+```bash
+curl http://$ENDPOINT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "moonshotai/Kimi-K2-Thinking",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant with deep reasoning capabilities."
+      },
+      {
+        "role": "user",
+        "content": "Design a distributed system for real-time analytics."
+      }
+    ]
+  }' | jq .
+```
+
+To shut down all resources:
+
+```bash
+sky serve down kimi-k2-thinking
+```
+
+See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html).
diff --git a/llm/kimi-k2-thinking/kimi-k2-thinking-high-throughput.sky.yaml b/llm/kimi-k2-thinking/kimi-k2-thinking-high-throughput.sky.yaml
@@ -0,0 +1,41 @@
+# Serve Kimi-K2-Thinking with SkyPilot and vLLM (High Throughput Mode).
+# Uses Decode Context Parallel (DCP) for 43% faster token generation and 26% higher throughput.
+#
+# Usage:
+#   sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht
+#   sky serve up kimi-k2-thinking-high-throughput.sky.yaml -n kimi-k2-thinking-ht
+envs:
+  MODEL_NAME: moonshotai/Kimi-K2-Thinking
+
+
+resources:
+  image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
+  accelerators: H200:8
+  cpus: 100+
+  memory: 1000+
+  ports: 8081
+
+run: |
+  echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...'
+  
+  vllm serve $MODEL_NAME \
+    --port 8081 \
+    --tensor-parallel-size 8 \
+    --decode-context-parallel-size 8 \
+    --enable-auto-tool-choice \
+    --tool-call-parser kimi_k2 \
+    --reasoning-parser kimi_k2 \
+    --trust-remote-code
+
+service:
+  replicas: 1
+  # An actual request for readiness probe.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: What is 2+2?
+      max_tokens: 10
+
diff --git a/llm/kimi-k2-thinking/kimi-k2-thinking.sky.yaml b/llm/kimi-k2-thinking/kimi-k2-thinking.sky.yaml
@@ -0,0 +1,39 @@
+# Serve Kimi-K2-Thinking with SkyPilot and vLLM (Low Latency Mode).
+# This model supports deep thinking & tool orchestration with reasoning capabilities.
+#
+# Usage:
+#   sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking
+#   sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking
+envs:
+  MODEL_NAME: moonshotai/Kimi-K2-Thinking
+
+resources:
+  image_id: docker:vllm/vllm-openai:nightly-f849ee739cdb3d82fce1660a6fd91806e8ae9bff
+  accelerators: H200:8
+  cpus: 100+
+  memory: 1000+
+  ports: 8081
+
+run: |
+  echo 'Starting vLLM API server for Kimi-K2-Thinking (Low Latency Mode)...'
+  
+  vllm serve $MODEL_NAME \
+    --port 8081 \
+    --tensor-parallel-size 8 \
+    --enable-auto-tool-choice \
+    --tool-call-parser kimi_k2 \
+    --reasoning-parser kimi_k2 \
+    --trust-remote-code
+
+service:
+  replicas: 1
+  # An actual request for readiness probe.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: What is 2+2?
+      max_tokens: 10
+

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+../../generated-examples/kimi-k2-thinking.md`