-
Notifications
You must be signed in to change notification settings - Fork 118
[WIP] Add Latency predictor to EPP #1161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kaushikmitr The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @kaushikmitr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
type ProfileRunResult struct { | ||
TargetPods []Pod | ||
// RawScores is a map of raw scores for each pod, keyed by scorer type. | ||
RawScores map[string]map[Pod]float64 | ||
} | ||
|
||
// SchedulingResult captures the result of the scheduling cycle. | ||
type SchedulingResult struct { | ||
ProfileResults map[string]*ProfileRunResult | ||
AllProfileRunResults map[string]*ProfileRunResult | ||
PrimaryProfileName string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably discuss these changes (cc @kfswain).
the new scheduler extension points and the mechanism around it was designed very carefully, while these changes break some of the concepts of the new scheduler design. for example, one guiding principle was that scores are internal to the scheduler (not returned to caller).
I understand that this task may add a new requirement that we haven't considered, but we should be careful not to break the guiding principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, while this is a WIP branch. I would expect the final state to be conformant to how other plugins operate or make a case for why the interface should be expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a WIP PR and at this point we are still at POC stage. Just created this PR for early alignment. Our next step is to demonstrate that the latency predictor can achieve SLO aware routing before we finalize the design on how it fits inside EPP. Here is a doc I shared with auto-scaling llm-d team. https://docs.google.com/document/d/1q56wr3N5XGx0B21MzHu5oBsCiGi9VrbZAvyhP2VFG_c/edit?tab=t.0
Converted the PR to draft in addition to WIP tag to dissuade review. This is still in progress, and shouldn't be considered for review unless explicitly asked by the authors to do so. |
Summary
This PR wires a complete latency‑prediction workflow into the EPP, which we will need for SLO‑aware scheduling decisions and continuous online retraining of TTFT/TPOT pediction models. It introduces two new Python sidecars (prediction + training), a Go async client, and hooks at every stage of the request lifecycle—from startup flags through scheduling, streaming, and response—for both forecasting and recording real‑world latencies. It also updates the EPP deployment manifest to include these sidecars.
What’s Added
Latency‑Predictor Sidecars (
latencypredictor-v1/
)prediction_server.py
FastAPI service exposing
/predict
,/status
,/reload
Loads/scales XGBoost or Bayesian Ridge models, returns point estimates + uncertainty bounds
Background
ModelSyncer
for pulling new artifacts from the training servicetraining_server.py
FastAPI service exposing
/sample
,/model/{name}/download
,/model/{name}/info
,/status
Ingests observed TTFT/TPOT samples into bounded deques
Periodically retrains models (Bayesian Ridge or XGBoost), evaluates on held‑out data, publishes updated artifacts
Go Async Client (
pkg/epp/latencypredictorasync/latencypredictor_async.go
)Buffers and bulk‑flushes training samples to the training sidecar
Load‑balances and issues
/predict
requests for scheduling and streaming hooksPeriodically refreshes model metadata and raw metrics
Exposes
Predict(ctx, req)
andAddTrainingDataBulk(entries)
interfacesEPP Deployment Manifest (
config/manifests/inferencepool-resources-lp.yaml
)Defines the EPP Deployment with both the prediction and training sidecar containers alongside the main gateway container
Configures container images, health‑checks, volumes, and environment variables for the latency‑predictor services
Lives in
config/manifests/inferencepool-resources-lp.yaml
for operator referenceWhat’s Changed
EPP Bootstrap & Server Runner
cmd/epp/runner/runner.go
New
--enable-latency-predictor
flag / ENV varInitializes predictor client & registers its background flush/refresh loop
Injects predictor into
requestcontrol.NewDirectorWithConfig
andExtProcServerRunner
pkg/epp/server/runserver.go
ExtProcServerRunner
struct to carry theLatencyPredictor
interfaceDirector & Request Flow
director.go
&latencypredictor_helper.go
At scheduling time: call predictor with pod metrics + request features, store
PredictedTTFTForScheduling
andPredictedTPOTForScheduling
onReqCtx
At response time: in
HandleResponseHeaders
/HandleResponseBodyChunk
, record actual TTFT/TPOT samples, send them to training, and issue further mid‑stream predictionspkg/epp/handlers/server.go
Expanded
RequestContext
with fields for actual vs. predicted TTFT/TPOT, sampling state, and timestampsEmits Prometheus metrics for both observed and forecasted latencies on request completion
pkg/epp/handlers/response.go
ttft_ms
,predicted_ttft_ms
,tpot_observations_ms
,predicted_tpot_observations_ms
,avg_tpot_ms
,avg_predicted_tpot_ms
Scheduler Enhancements
scheduler.go
SchedulingResult
now returns allProfileResult
entries (not just the single best)This is to surface the “prefix cache score” on each profile so predictor logic and future heuristics can use raw prefix cache scores and other scheduler traces
Observability & Metrics
Prometheus histograms for prediction‑service call latency and error rates
Counters/gauges for sample ingestion, model‑retrain cycles, and model freshness
End‑to‑end metrics comparing actual vs. predicted TTFT/TPOT