|
| 1 | + |
| 2 | +<!-- $REMOVE --> |
| 3 | +# Run Kimi K2 Thinking on Kubernetes or Any Cloud |
| 4 | +<!-- $END_REMOVE --> |
| 5 | +<!-- $UNCOMMENT# Kimi K2 Thinking --> |
| 6 | + |
| 7 | +[Kimi K2 Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) is an advanced large language model created by [Moonshot AI](https://www.moonshot.ai/). |
| 8 | + |
| 9 | +This recipe shows how to run Kimi K2 Thinking with reasoning capabilities on your Kubernetes or any cloud. It includes two modes: |
| 10 | + |
| 11 | +- **Low Latency (TP8)**: Best for interactive applications requiring quick responses |
| 12 | +- **High Throughput (TP8+DCP8)**: Best for batch processing and high-volume serving scenarios |
| 13 | + |
| 14 | + |
| 15 | +## Prerequisites |
| 16 | + |
| 17 | +- Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)). |
| 18 | +- Check that `sky check` shows clouds or Kubernetes is enabled. |
| 19 | +- **Note**: This model requires 8x H200 or H20 GPUs. |
| 20 | + |
| 21 | +## Run Kimi K2 Thinking (Low Latency Mode) |
| 22 | + |
| 23 | +For low-latency scenarios, use tensor parallelism: |
| 24 | + |
| 25 | +```bash |
| 26 | +sky launch kimi-k2-thinking.sky.yaml -c kimi-k2-thinking |
| 27 | +``` |
| 28 | + |
| 29 | +`kimi-k2-thinking.sky.yaml` uses **tensor parallelism** across 8 GPUs for optimal low-latency performance. |
| 30 | + |
| 31 | +🎉 **Congratulations!** 🎉 You have now launched the Kimi K2 Thinking LLM with reasoning capabilities on your infra. |
| 32 | + |
| 33 | +## Run Kimi K2 Thinking (High Throughput Mode) |
| 34 | + |
| 35 | +For high-throughput scenarios, use Decode Context Parallel (DCP) for **43% faster token generation** and **26% higher throughput**: |
| 36 | + |
| 37 | +```bash |
| 38 | +sky launch kimi-k2-thinking-high-throughput.sky.yaml -c kimi-k2-thinking-ht |
| 39 | +``` |
| 40 | + |
| 41 | +The `kimi-k2-thinking-high-throughput.sky.yaml` adds `--decode-context-parallel-size 8` to enable DCP: |
| 42 | + |
| 43 | +```yaml |
| 44 | +run: | |
| 45 | + echo 'Starting vLLM API server for Kimi-K2-Thinking (High Throughput Mode with DCP)...' |
| 46 | + |
| 47 | + vllm serve $MODEL_NAME \ |
| 48 | + --port 8081 \ |
| 49 | + --tensor-parallel-size 8 \ |
| 50 | + --decode-context-parallel-size 8 \ |
| 51 | + --enable-auto-tool-choice \ |
| 52 | + --tool-call-parser kimi_k2 \ |
| 53 | + --reasoning-parser kimi_k2 \ |
| 54 | + --trust-remote-code |
| 55 | +``` |
| 56 | +
|
| 57 | +### DCP Performance Gains |
| 58 | +
|
| 59 | +From [vLLM's benchmark](https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2-Think.html): |
| 60 | +
|
| 61 | +| Metric | TP8 (Low Latency) | TP8+DCP8 (High Throughput) | Improvement | |
| 62 | +|--------|-------------------|----------------------------|-------------| |
| 63 | +| Request Throughput (req/s) | 1.25 | 1.57 | **+25.6%** | |
| 64 | +| Output Token Throughput (tok/s) | 485.78 | 695.13 | **+43.1%** | |
| 65 | +| Mean TTFT (sec) | 271.2 | 227.8 | **+16.0%** | |
| 66 | +| KV Cache Size (tokens) | 715,072 | 5,721,088 | **8x** | |
| 67 | +
|
| 68 | +## Chat with Kimi K2 Thinking with OpenAI API |
| 69 | +
|
| 70 | +To curl `/v1/chat/completions`: |
| 71 | + |
| 72 | +```bash |
| 73 | +ENDPOINT=$(sky status --endpoint 8081 kimi-k2-thinking) |
| 74 | +
|
| 75 | +curl http://$ENDPOINT/v1/chat/completions \ |
| 76 | + -H "Content-Type: application/json" \ |
| 77 | + -d '{ |
| 78 | + "model": "moonshotai/Kimi-K2-Thinking", |
| 79 | + "messages": [ |
| 80 | + { |
| 81 | + "role": "system", |
| 82 | + "content": "You are a helpful assistant with deep reasoning capabilities." |
| 83 | + }, |
| 84 | + { |
| 85 | + "role": "user", |
| 86 | + "content": "Explain how to solve the traveling salesman problem for 10 cities." |
| 87 | + } |
| 88 | + ] |
| 89 | + }' | jq . |
| 90 | +``` |
| 91 | + |
| 92 | +The model will provide its reasoning process in the response, showing its chain-of-thought approach. |
| 93 | + |
| 94 | +## Clean up resources |
| 95 | +To shut down all resources: |
| 96 | + |
| 97 | +```bash |
| 98 | +sky down kimi-k2-thinking |
| 99 | +``` |
| 100 | + |
| 101 | +## Serving Kimi-K2-Thinking: scaling up with SkyServe |
| 102 | + |
| 103 | +With no change to the YAML, launch a fully managed service with autoscaling replicas and load-balancing on your infra: |
| 104 | + |
| 105 | +```bash |
| 106 | +sky serve up kimi-k2-thinking.sky.yaml -n kimi-k2-thinking |
| 107 | +``` |
| 108 | + |
| 109 | +Wait until the service is ready: |
| 110 | + |
| 111 | +```bash |
| 112 | +watch -n10 sky serve status kimi-k2-thinking |
| 113 | +``` |
| 114 | + |
| 115 | +Get a single endpoint that load-balances across replicas: |
| 116 | + |
| 117 | +```bash |
| 118 | +ENDPOINT=$(sky serve status --endpoint kimi-k2-thinking) |
| 119 | +``` |
| 120 | + |
| 121 | +> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs. |
| 122 | + |
| 123 | +To curl the endpoint: |
| 124 | + |
| 125 | +```bash |
| 126 | +curl http://$ENDPOINT/v1/chat/completions \ |
| 127 | + -H "Content-Type: application/json" \ |
| 128 | + -d '{ |
| 129 | + "model": "moonshotai/Kimi-K2-Thinking", |
| 130 | + "messages": [ |
| 131 | + { |
| 132 | + "role": "system", |
| 133 | + "content": "You are a helpful assistant with deep reasoning capabilities." |
| 134 | + }, |
| 135 | + { |
| 136 | + "role": "user", |
| 137 | + "content": "Design a distributed system for real-time analytics." |
| 138 | + } |
| 139 | + ] |
| 140 | + }' | jq . |
| 141 | +``` |
| 142 | + |
| 143 | +To shut down all resources: |
| 144 | + |
| 145 | +```bash |
| 146 | +sky serve down kimi-k2-thinking |
| 147 | +``` |
| 148 | + |
| 149 | +See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html). |
0 commit comments