-
Notifications
You must be signed in to change notification settings - Fork 293
Open
Labels
feature requestNew feature or requestNew feature or request
Description
Feature Description
The routing logic returns a subset of potential backend vLLMs as candidates, rather than just one option, similar to a topK selection. When a proxy makes a request, it will retry among the available workers until it reaches a predefined limit (suggested limit: 3 attempts).
Why do you need this feature?
In scenarios where vLLM encounters common HTTP errors or suffers from an unrecoverable runtime error, the router can promptly redirect it to an alternative target endpoint. This ensures that the system can quickly adapt to failures and maintain service availability.
Additional context
- Implementing an exponential backoff mechanism might violate the SLO, such as a 2s TTFT. Therefore, we should opt for a quick retry strategy with a limited number of attempts.
- Workers can sometimes experience unrecoverable runtime errors. For example, I've encountered situations where the KV transfer buffer setting was too small, causing even a single prefill forward transfer to be blocked by the buffer availability condition. However, other endpoints may still be capable of handling the request.
- When implementing a re-routing logic may break the boundry of the routing and proxy handling.
YuhanLiu11, nejch and max-wittig
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request