Skip to content

Commit b38b4f9

Browse files
authored
docs: update balance and fix cache bug (vllm-project#1634)
* docs: update balance and fix cache bug Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> * fix(extproc): satisfy semantic cache scope lint Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> --------- Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai>
1 parent 45bfd49 commit b38b4f9

12 files changed

+511
-50
lines changed

deploy/amd/README.md

Lines changed: 49 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,15 @@ This playbook documents the AMD reference profile for a single real ROCm vLLM ba
1919
- `routing.decisions` uses tier-prefixed dual-layer families
2020
- `global.model_catalog.modules` only tightens learned-signal thresholds for conservative overlays
2121

22-
The active AMD profile contains 22 routing decisions:
22+
The active AMD profile contains 23 routing decisions:
2323

2424
- `simple_*` (3): lowest-cost FAQ and general fallback
2525
- `medium_*` (5): low-to-mid-cost domain/scenario refinement
2626
- `verified_*` (5): evidence-sensitive overlays layered just above their base routes
2727
- `feedback_*` (2): explicit correction and clarification recovery lanes
2828
- `complex_*` (3): hard technical, STEM, and agentic synthesis
2929
- `reasoning_*` (3): high-reasoning escalation
30+
- `engaged_general` (1): emotion-aware and urgency-aware general fallback above the cheap default lane
3031
- `premium_*` (1): one premium legal path only
3132

3233
## Installation
@@ -106,6 +107,7 @@ vLLM Semantic Router (:8899)
106107
+-- signal evaluation
107108
| - keyword
108109
| - embedding
110+
| - structure
109111
| - fact_check
110112
| - user_feedback
111113
| - preference
@@ -118,6 +120,8 @@ vLLM Semantic Router (:8899)
118120
| - domain partition winner
119121
| - intent partition winner
120122
| - difficulty band
123+
| - emotion band
124+
| - urgency band
121125
| - verification band
122126
|
123127
+-- tiered decision selection
@@ -162,27 +166,28 @@ Pricing is intentionally exaggerated for Insights demos so savings are easy to s
162166

163167
| Priority | Decision | Alias | What it is for | Match sketch |
164168
|---------:|----------|-------|----------------|--------------|
165-
| 260 | `premium_legal` | `anthropic/claude-opus-4.6` | Highest-risk legal and compliance analysis | `domain:law` + `projection:verification_required` + premium legal embedding or hard legal-risk / hard routing band |
169+
| 260 | `premium_legal` | `anthropic/claude-opus-4.6` | Highest-risk legal and compliance analysis | law or explicit legal-risk cues + premium legal embedding, verification overlay, or medium/hard `legal_risk` |
166170
| 250 | `reasoning_math` | `openai/gpt5.4` | Proofs, derivations, and hard math | `domain:math` + `projection:balance_reasoning` |
167171
| 245 | `reasoning_philosophy` | `openai/gpt5.4` | Philosophy prompts that need deep argumentation | `domain:philosophy` + `projection:balance_reasoning` |
168172
| 243 | `complex_agentic` | `google/gemini-3.1-pro` | High-structure execution plans, migrations, and workflow orchestration | agentic embedding / preference / markers + `projection:balance_complex` or `projection:balance_reasoning`, excluding architecture markers |
169173
| 240 | `complex_architecture` | `google/gemini-3.1-pro` | Complex systems and architecture design | CS or engineering + architecture embedding / markers + `projection:balance_complex` or `projection:balance_reasoning` |
170174
| 235 | `complex_stem` | `google/gemini-3.1-pro` | Complex STEM synthesis outside dedicated math | STEM domain + STEM or research embedding, or high routing band |
171175
| 232 | `feedback_wrong_answer_verified` | `google/gemini-3.1-pro` | Explicit correction on evidence-sensitive follow-ups | `user_feedback:wrong_answer` + correction markers + short/medium context + verification pressure or evidence-synthesis escalation |
172-
| 220 | `medium_code_general` | `qwen/qwen3.5-rocm` | Low-medium cost coding, debugging, and technical Q&A | code domain / markers / embedding / coding preference + `projection:balance_medium` or `projection:balance_complex`, excluding agentic, architecture, and creative cues |
176+
| 220 | `medium_code_general` | `qwen/qwen3.5-rocm` | Low-medium cost coding, debugging, and technical Q&A | code domain / markers / embedding + `projection:balance_medium` or `projection:balance_complex`, or short urgent code prompts with `projection:balance_simple` + `projection:urgency_elevated` |
173177
| 216 | `verified_business` | `google/gemini-2.5-flash-lite` | Evidence-sensitive business or economics requests | business/economics + `projection:verification_required` or hard evidence synthesis + business embedding or medium/complex routing band |
174178
| 215 | `medium_business` | `qwen/qwen3.5-rocm` | Mid-tier business and economics analysis | business/economics + `embedding:business_analysis` + `projection:balance_medium` or `projection:balance_complex`, excluding verification overlay |
175179
| 214 | `verified_health` | `google/gemini-3.1-pro` | Evidence-sensitive health and medical guidance | `domain:health` + `projection:verification_required` + health embedding or medium/complex/reasoning band |
176180
| 211 | `verified_history` | `google/gemini-2.5-flash-lite` | Source-sensitive history explanation | `domain:history` + `projection:verification_required` or hard evidence synthesis + history embedding or medium/complex routing band |
177181
| 210 | `medium_history` | `qwen/qwen3.5-rocm` | Mid-tier history explanation and comparison | `domain:history` + `embedding:history_explainer` + `projection:balance_medium` or `projection:balance_complex`, excluding verification overlay |
178182
| 205 | `medium_psychology` | `qwen/qwen3.5-rocm` | Psychology and behavior queries with nuanced explanation | `domain:psychology` + `embedding:psychology_support` + `projection:balance_medium` or `projection:balance_complex` |
183+
| 202 | `engaged_general` | `google/gemini-2.5-flash-lite` | General or psychology-adjacent prompts with visible emotion or urgency | `projection:emotion_positive` or `projection:emotion_negative` or `projection:urgency_elevated` + general/psychology cues, excluding specialist and verification-heavy lanes |
179184
| 200 | `medium_creative` | `google/gemini-2.5-flash-lite` | Creative writing, copywriting, and ideation | creative markers / embedding / collaboration preference + `projection:balance_simple` or `projection:balance_medium` |
180185
| 190 | `reasoning_general` | `openai/gpt5.4` | Non-specialist deep analysis and multi-step reasoning | reasoning / research / multi-step cues + `projection:balance_complex` or `projection:balance_reasoning`, excluding specialist embeddings and broad technical markers |
181186
| 185 | `feedback_need_clarification` | `qwen/qwen3.5-rocm` | Cheap clarification follow-up lane | `user_feedback:need_clarification` + clarification markers + short/medium context |
182187
| 181 | `verified_fast_qa_zh` | `qwen/qwen3.5-rocm` | Chinese short FAQ with explicit verification ask | `embedding:fast_qa_zh` + `language:zh` + `context:short_context` + simple/medium routing band + verification cue or fact-check pressure |
183-
| 180 | `simple_fast_qa_zh` | `qwen/qwen3.5-rocm` | Cheapest Chinese factual / definitional answers | `embedding:fast_qa_zh` + `language:zh` + `context:short_context` + `projection:balance_simple`, excluding verification overlay |
188+
| 180 | `simple_fast_qa_zh` | `qwen/qwen3.5-rocm` | Cheapest Chinese factual / definitional answers | `embedding:fast_qa_zh` + `language:zh` + `context:short_context` + `projection:balance_simple`, excluding verification, code, and urgency overlays |
184189
| 176 | `verified_fast_qa_en` | `qwen/qwen3.5-rocm` | English short FAQ with explicit verification ask | `embedding:fast_qa_en` + `language:en` + `context:short_context` + simple/medium routing band + verification cue or fact-check pressure |
185-
| 175 | `simple_fast_qa_en` | `qwen/qwen3.5-rocm` | Cheapest English factual / definitional answers | `embedding:fast_qa_en` + `language:en` + `context:short_context` + `projection:balance_simple`, excluding verification overlay |
190+
| 175 | `simple_fast_qa_en` | `qwen/qwen3.5-rocm` | Cheapest English factual / definitional answers | `embedding:fast_qa_en` + `language:en` + `context:short_context` + `projection:balance_simple`, excluding verification, code, and urgency overlays |
186191
| 170 | `simple_general` | `qwen/qwen3.5-rocm` | Lowest-cost fallback for non-specialized traffic | short simple traffic, or medium-context `domain:other` traffic with simple/medium band, excluding fast-QA embeddings |
187192

188193
This ordering is intentional:
@@ -200,8 +205,9 @@ The profile uses the standard vSR signal families directly under `routing.signal
200205

201206
| Signal family | Role in this profile | Representative names |
202207
|---------------|----------------------|----------------------|
203-
| `keywords` | explicit lexical confirmation for route style, verification asks, feedback cues, and task shape | `verification_markers`, `agentic_request_markers`, `architecture_markers`, `clarification_feedback_markers` |
208+
| `keywords` | explicit lexical confirmation for route style, verification asks, emotion or urgency cues, feedback cues, and task shape | `verification_markers`, `emotion_negative_markers`, `urgency_markers`, `clarification_feedback_markers` |
204209
| `embeddings` | learned intent and specialist boundaries | `fast_qa_en`, `architecture_design`, `business_analysis`, `premium_legal_analysis`, `reasoning_general_en` |
210+
| `structure` | cheap structural overlays for workflow formatting and punctuation emphasis | `ordered_workflow`, `numbered_steps`, `exclamation_emphasis` |
205211
| `fact_check` | evidence-sensitive detection that feeds verification pressure | `needs_fact_check` |
206212
| `user_feedbacks` | explicit correction or clarification overlays | `wrong_answer`, `need_clarification` |
207213
| `preferences` | collaboration style and request framing | `coding_partner`, `creative_collaboration`, `agentic_execution` |
@@ -214,8 +220,9 @@ Notable profile-specific signal details:
214220

215221
- `context` bands are non-overlapping: `short_context` is `0-999`, `medium_context` is `1K-7999`, and `long_context` is `8K-256K`.
216222
- `complexity` signals are reusable across both route predicates and projection scores through sublevels such as `code_task:hard` or `evidence_synthesis:medium`.
223+
- the emotion and urgency overlays stay heuristic on purpose: lexical markers and repeated `!` / `` are used as secondary coordination signals instead of replacing the learned primary-intent lanes.
217224
- short lexical verification and correction cues are intentionally literal in this profile, so examples that say `verify this`, `answer with citations`, or Chinese `给出处` are more reliable than looser paraphrases.
218-
- `jailbreak` and `pii` signals are still defined in the profile for safety surfaces, but they are not the primary routing predicates for the 22 active decisions.
225+
- `jailbreak` and `pii` signals are still defined in the profile for safety surfaces, but they are not the primary routing predicates for the 23 active decisions.
219226

220227
## Projection Overview
221228

@@ -227,6 +234,10 @@ The profile uses `routing.projections` as the coordination layer between raw sig
227234
| `balance_intent_partition` | partition | resolves one learned-intent winner across the maintained embedding lanes | `agentic_workflows`, `architecture_design`, `code_general`, `creative_tasks`, `fast_qa_en`, `fast_qa_zh`, `general_chat_fallback`, and related specialist embeddings |
228235
| `difficulty_score` | score | blends context, keywords, embeddings, and complexity sublevels into one difficulty signal | source for the difficulty band mapping |
229236
| `difficulty_band` | mapping | converts `difficulty_score` into reusable routing bands | `balance_simple`, `balance_medium`, `balance_complex`, `balance_reasoning` |
237+
| `emotion_valence` | score | blends positive and negative affect markers into one lightweight emotional-overlay score | source for the emotion band mapping |
238+
| `emotion_band` | mapping | converts `emotion_valence` into reusable emotional overlays | `emotion_positive`, `emotion_negative` |
239+
| `urgency_pressure` | score | blends urgency markers with exclamation-count emphasis into one urgency overlay | source for the urgency band mapping |
240+
| `urgency_band` | mapping | converts `urgency_pressure` into reusable urgency overlays | `urgency_standard`, `urgency_elevated` |
230241
| `verification_pressure` | score | blends `fact_check`, verification cues, high-stakes domains, long-context pressure, and wrong-answer correction pressure | source for the verification mapping |
231242
| `verification_band` | mapping | converts `verification_pressure` into verification routing outputs | `verification_standard`, `verification_required` |
232243

@@ -242,7 +253,7 @@ That lets the profile reuse one difficulty story and one verification story acro
242253

243254
Test these in the dashboard playground at `http://<your-server-ip>:8700`:
244255

245-
The same stable examples are also maintained as machine-readable probes in [`balance.probes.yaml`](./balance.probes.yaml) for live `POST /api/v1/eval` calibration loops. The maintained suite currently covers all 22 decisions with 54 probe variants, so routing changes are checked against a small robustness set instead of one crafted prompt per route.
256+
The same stable examples are also maintained as machine-readable probes in [`balance.probes.yaml`](./balance.probes.yaml) for live `POST /api/v1/eval` calibration loops. The maintained suite currently covers all 23 decisions with 58 probe variants, so routing changes are checked against a small robustness set instead of one crafted prompt per route.
246257

247258
Each decision below includes every maintained probe variant from the manifest, so the README stays copy-pasteable for playground checks and aligned with the executable eval suite.
248259

@@ -408,6 +419,12 @@ A Java unit test is failing after a refactor; explain the most likely cause and
408419
After a refactor, an integration test started failing in a Java codebase. Explain the most likely cause and the first code change to inspect.
409420
```
410421

422+
#### `urgent_bug_zh`
423+
424+
```text
425+
这太离谱了!!!马上告诉我该怎么处理这个 bug。
426+
```
427+
411428
### `verified_business`
412429

413430
Expected alias: `google/gemini-2.5-flash-lite`
@@ -540,6 +557,30 @@ Why do people fall into confirmation bias, and what strategies usually help redu
540557
Why do people procrastinate on important work, and what interventions usually help?
541558
```
542559

560+
### `engaged_general`
561+
562+
Expected alias: `google/gemini-2.5-flash-lite`
563+
564+
Emotion-aware and urgency-aware general lane for prompts that should avoid brittle specialist or fast-QA misroutes.
565+
566+
#### `celebratory_reply_zh`
567+
568+
```text
569+
太好了!!!我终于拿到 offer 了,帮我写一段兴奋但得体的回复。
570+
```
571+
572+
#### `roommate_text`
573+
574+
```text
575+
I am overwhelmed right now!! Help me write a calm text to my roommate and keep it supportive.
576+
```
577+
578+
#### `dinner_reschedule`
579+
580+
```text
581+
This is ridiculous!! Help me write a calm message to reschedule tonight's dinner.
582+
```
583+
543584
### `medium_creative`
544585

545586
Expected alias: `google/gemini-2.5-flash-lite`

deploy/amd/balance.probes.yaml

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ decisions:
9797
- id: medium_code_general
9898
expected_decision: medium_code_general
9999
expected_alias: qwen/qwen3.5-rocm
100-
objective: Mid-tier coding help without architecture-heavy or agentic workflow cues.
100+
objective: Mid-tier coding help without architecture-heavy or agentic workflow cues, including short urgent bug triage.
101101
variants:
102102
- id: python_stack_trace
103103
query: Debug this Python stack trace and suggest the most likely fix.
@@ -108,6 +108,9 @@ decisions:
108108
- id: integration_test_refactor
109109
query: After a refactor, an integration test started failing in a Java codebase. Explain the most likely cause and the first code change to inspect.
110110
tags: [robustness, coding]
111+
- id: urgent_bug_zh
112+
query: 这太离谱了!!!马上告诉我该怎么处理这个 bug。
113+
tags: [robustness, coding, urgent]
111114
- id: verified_business
112115
expected_decision: verified_business
113116
expected_alias: google/gemini-2.5-flash-lite
@@ -186,6 +189,20 @@ decisions:
186189
- id: procrastination_important_work
187190
query: Why do people procrastinate on important work, and what interventions usually help?
188191
tags: [robustness, psychology]
192+
- id: engaged_general
193+
expected_decision: engaged_general
194+
expected_alias: google/gemini-2.5-flash-lite
195+
objective: General prompts with explicit emotion or urgency that should avoid brittle specialist misroutes.
196+
variants:
197+
- id: celebratory_reply_zh
198+
query: 太好了!!!我终于拿到 offer 了,帮我写一段兴奋但得体的回复。
199+
tags: [baseline, emotion]
200+
- id: roommate_text
201+
query: I am overwhelmed right now!! Help me write a calm text to my roommate and keep it supportive.
202+
tags: [paraphrase, emotion]
203+
- id: dinner_reschedule
204+
query: This is ridiculous!! Help me write a calm message to reschedule tonight's dinner.
205+
tags: [robustness, emotion]
189206
- id: medium_creative
190207
expected_decision: medium_creative
191208
expected_alias: google/gemini-2.5-flash-lite

0 commit comments

Comments
 (0)