You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running vLLM Semantic Router on AMD Developer Cloud is not just about bringing up one more inference endpoint. It is about turning it into a routed multi-tier system that can classify requests, choose a semantic lane, and make replay and Insights immediately useful.
14
+
Running [vLLM Semantic Router](https://vllm-semantic-router.com) on AMD Developer Cloud is not just about bringing up one more inference endpoint. It is about turning it into a routed multi-tier system that can classify requests, choose a semantic lane, and make replay and Insights immediately useful.
15
15
16
16
This post walks through the practical path: start the ROCm backend on an AMD Developer Cloud instance, install vLLM-SR, import the reference profile, and validate the deployment end to end.
17
17
@@ -73,17 +73,17 @@ This architecture opens up a particularly interesting opportunity for AMD, becau
73
73
74
74
The most immediate opportunity is intelligent routing. A single ROCm backend on AMD Developer Cloud can serve as the physical execution layer for multiple logical lanes. That means teams can prototype a Mixture-of-Models experience, cost-aware routing, replay-driven debugging, and tiered product behavior without first standing up a large multi-backend fleet.
75
75
76
-
In the AMD reference profile, the cheapest, medium, complex, reasoning, and premium lanes all resolve onto one self-hosted Qwen backend. The router still gives you differentiated behavior because the policy lives in signals, projections, and decisions, not only in the number of containers you run.
76
+
In the AMD reference profile, the cheapest, medium, complex, reasoning, and premium lanes all resolve onto different models. The router still gives you differentiated behavior because the policy lives in signals, projections, and decisions, not only in the number of containers you run.
77
77
78
78
### 2. Privacy Routing and Local-First Governance
79
79
80
-
The second opportunity is privacy routing. This repository already includes a maintained privacy recipe that keeps PII, private code, internal documents, and suspicious prompts on a local lane while only escalating clearly non-sensitive reasoning work when policy allows it. That pattern is especially meaningful on AMD because it supports a local-first deployment story: keep sensitive traffic on infrastructure you control, audit every decision, and make cloud escalation a governed exception instead of the default.
80
+
The second opportunity is privacy routing, that keeps PII, private code, internal documents, and suspicious prompts on a local lane while only escalating clearly non-sensitive reasoning work when policy allows it. That pattern is especially meaningful on AMD because it supports a local-first deployment story: keep sensitive traffic on infrastructure you control, audit every decision, and make cloud escalation a governed exception instead of the default.
81
81
82
82
For enterprises, that means AMD-backed deployments can become the trusted default lane for internal copilots, regulated workloads, or hybrid private AI systems. For developers, it means privacy is not just a hosting choice; it becomes a routing policy.
83
83
84
84
### 3. Personal AI and Local Personal Agents
85
85
86
-
The third opportunity is personal AI. Once routing, privacy, and reasoning are expressed as policy, an AMD-hosted stack can support assistants that feel more personal and more controlled. A personal AI system can keep ordinary tasks, memory-aware follow-ups, and private context on a local lane, while only escalating special cases when explicitly permitted.
86
+
The third opportunity is personal AI like deploying a personal model on AMD AI MAX+ and connecting to external Models as needed. Once routing, privacy, and reasoning are expressed as policy, an AMD-hosted stack can support assistants that feel more personal and more controlled. A personal AI system can keep ordinary tasks, memory-aware follow-ups, and private context on a local lane, while only escalating special cases when explicitly permitted.
87
87
88
88
That makes AMD interesting not only for enterprise infrastructure, but also for self-hosted assistants, home-lab AI, and local-first personal workflows. The important point is that Semantic Router lets the system distinguish between “keep this local,” “this is cheap and routine,” and “this needs deeper reasoning,” instead of treating all personal AI traffic as one undifferentiated workload.
|**Projections**|`partitions`, `scores`, `mappings`| Coordinate competing matches and emit named routing bands |
38
+
|**Decisions**| AND/OR policy rules over signals and projections | Select the active route and model candidates |
44
39
45
-
**How it works**: Signals are extracted from requests, combined using AND/OR operators in decision rules, and used to select the best model and configuration.
40
+
**How it works**: Signals are extracted from requests, projections coordinate
41
+
matched evidence, decision rules evaluate the resulting facts, and the chosen
42
+
route drives plugins plus model dispatch.
46
43
47
44
### Plugin Chain Architecture
48
45
49
46
Extensible plugin system for request/response processing:
50
47
51
-
| Plugin Type | Description | Use Case |
52
-
|------------|-------------|----------|
53
-
|**semantic-cache**| Semantic similarity-based caching | Reduce latency and costs for similar queries |
|**pii**| Personally identifiable information detection | Protect sensitive data and ensure compliance|
53
+
|**system_prompt**| Dynamic system prompt injection | Add context-aware instructions per route|
54
+
|**header_mutation**| HTTP header manipulation | Control routing and backend behavior|
55
+
|**hallucination**| Token-level hallucination detection| Real-time fact verification during generation |
59
56
60
57
**How it works**: Plugins form a processing chain, each plugin can inspect/modify requests and responses, with configurable enable/disable per decision.
**This is collective intelligence**: No single component made the decision. The intelligence emerged from the collaboration of signals, rules, models, and plugins.
201
+
**This is collective intelligence**: No single component made the decision.
202
+
The intelligence emerged from the collaboration of signals, projections, rules,
0 commit comments