Skip to content

feature: A proposal for proxy request retry. #382

@ggaaooppeenngg

Description

@ggaaooppeenngg

Feature Description

The routing logic returns a subset of potential backend vLLMs as candidates, rather than just one option, similar to a topK selection. When a proxy makes a request, it will retry among the available workers until it reaches a predefined limit (suggested limit: 3 attempts).

Why do you need this feature?

In scenarios where vLLM encounters common HTTP errors or suffers from an unrecoverable runtime error, the router can promptly redirect it to an alternative target endpoint. This ensures that the system can quickly adapt to failures and maintain service availability.

Additional context

  1. Implementing an exponential backoff mechanism might violate the SLO, such as a 2s TTFT. Therefore, we should opt for a quick retry strategy with a limited number of attempts.
  2. Workers can sometimes experience unrecoverable runtime errors. For example, I've encountered situations where the KV transfer buffer setting was too small, causing even a single prefill forward transfer to be blocked by the buffer availability condition. However, other endpoints may still be capable of handling the request.
  3. When implementing a re-routing logic may break the boundry of the routing and proxy handling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions