[WIP] Add Latency predictor to EPP #1161

kaushikmitr · 2025-07-15T02:26:47Z

Summary

This PR wires a complete latency‑prediction workflow into the EPP, which we will need for SLO‑aware scheduling decisions and continuous online retraining of TTFT/TPOT pediction models. It introduces two new Python sidecars (prediction + training), a Go async client, and hooks at every stage of the request lifecycle—from startup flags through scheduling, streaming, and response—for both forecasting and recording real‑world latencies. It also updates the EPP deployment manifest to include these sidecars.

What’s Added

Latency‑Predictor Sidecars (latencypredictor-v1/)
- prediction_server.py
  - FastAPI service exposing /predict, /status, /reload
  - Loads/scales XGBoost or Bayesian Ridge models, returns point estimates + uncertainty bounds
  - Background ModelSyncer for pulling new artifacts from the training service
- training_server.py
  - FastAPI service exposing /sample, /model/{name}/download, /model/{name}/info, /status
  - Ingests observed TTFT/TPOT samples into bounded deques
  - Periodically retrains models (Bayesian Ridge or XGBoost), evaluates on held‑out data, publishes updated artifacts
Go Async Client (pkg/epp/latencypredictorasync/latencypredictor_async.go)
- Buffers and bulk‑flushes training samples to the training sidecar
- Load‑balances and issues /predict requests for scheduling and streaming hooks
- Periodically refreshes model metadata and raw metrics
- Exposes Predict(ctx, req) and AddTrainingDataBulk(entries) interfaces
EPP Deployment Manifest (config/manifests/inferencepool-resources-lp.yaml)
- Defines the EPP Deployment with both the prediction and training sidecar containers alongside the main gateway container
- Configures container images, health‑checks, volumes, and environment variables for the latency‑predictor services
- Lives in config/manifests/inferencepool-resources-lp.yaml for operator reference

What’s Changed

EPP Bootstrap & Server Runner
- cmd/epp/runner/runner.go
  - New --enable-latency-predictor flag / ENV var
  - Initializes predictor client & registers its background flush/refresh loop
  - Injects predictor into requestcontrol.NewDirectorWithConfig and ExtProcServerRunner
- pkg/epp/server/runserver.go
  - Extended ExtProcServerRunner struct to carry the LatencyPredictor interface
Director & Request Flow
- director.go & latencypredictor_helper.go
  - At scheduling time: call predictor with pod metrics + request features, store PredictedTTFTForScheduling and PredictedTPOTForScheduling on ReqCtx
  - At response time: in HandleResponseHeaders/HandleResponseBodyChunk, record actual TTFT/TPOT samples, send them to training, and issue further mid‑stream predictions
- pkg/epp/handlers/server.go
  - Expanded RequestContext with fields for actual vs. predicted TTFT/TPOT, sampling state, and timestamps
  - Emits Prometheus metrics for both observed and forecasted latencies on request completion
- pkg/epp/handlers/response.go
  - Augments streaming SSE payloads with six new fields:
    ttft_ms, predicted_ttft_ms, tpot_observations_ms, predicted_tpot_observations_ms, avg_tpot_ms, avg_predicted_tpot_ms
Scheduler Enhancements
- scheduler.go
  - SchedulingResult now returns all ProfileResult entries (not just the single best)
  - This is to surface the “prefix cache score” on each profile so predictor logic and future heuristics can use raw prefix cache scores and other scheduler traces

Observability & Metrics

Prometheus histograms for prediction‑service call latency and error rates
Counters/gauges for sample ingestion, model‑retrain cycles, and model freshness
End‑to‑end metrics comparing actual vs. predicted TTFT/TPOT

netlify · 2025-07-15T02:26:53Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`a618d85`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/687c629195bab600087d3681
😎 Deploy Preview	https://deploy-preview-1161--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-07-15T02:26:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaushikmitr
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-07-15T02:26:55Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-15T02:26:56Z

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nirrozenbaum · 2025-07-15T05:54:29Z

pkg/epp/scheduling/types/types.go

 type ProfileRunResult struct {
 	TargetPods []Pod
+	// RawScores is a map of raw scores for each pod, keyed by scorer type.
+	RawScores map[string]map[Pod]float64
 }

 // SchedulingResult captures the result of the scheduling cycle.
 type SchedulingResult struct {
 	ProfileResults     map[string]*ProfileRunResult
+	AllProfileRunResults map[string]*ProfileRunResult
 	PrimaryProfileName string
 }


we should probably discuss these changes (cc @kfswain).
the new scheduler extension points and the mechanism around it was designed very carefully, while these changes break some of the concepts of the new scheduler design. for example, one guiding principle was that scores are internal to the scheduler (not returned to caller).

I understand that this task may add a new requirement that we haven't considered, but we should be careful not to break the guiding principles.

++, while this is a WIP branch. I would expect the final state to be conformant to how other plugins operate or make a case for why the interface should be expanded.

Yes, this is a WIP PR and at this point we are still at POC stage. Just created this PR for early alignment. Our next step is to demonstrate that the latency predictor can achieve SLO aware routing before we finalize the design on how it fits inside EPP. Here is a doc I shared with auto-scaling llm-d team. https://docs.google.com/document/d/1q56wr3N5XGx0B21MzHu5oBsCiGi9VrbZAvyhP2VFG_c/edit?tab=t.0

kfswain · 2025-07-17T22:39:38Z

Converted the PR to draft in addition to WIP tag to dissuade review. This is still in progress, and shouldn't be considered for review unless explicitly asked by the authors to do so.

kaushikmitr added 7 commits July 10, 2025 21:10

add latency predictor

b4cb6cc

Merge branch 'kubernetes-sigs:main' into latency-predictor

b17d01b

fix merge conflict

14d35b5

put the predictor functions in director in a helper function

779f47b

add scores to reqcxt

4d180e0

record prediction duration metrics

8c7067f

add prefix cache score to model input

3b9a9ef

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2025

k8s-ci-robot requested a review from ahg-g July 15, 2025 02:26

k8s-ci-robot requested a review from robscott July 15, 2025 02:26

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2025

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2025

nirrozenbaum reviewed Jul 15, 2025

View reviewed changes

kfswain marked this pull request as draft July 17, 2025 22:38

slo based routing changes

a618d85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add Latency predictor to EPP #1161

[WIP] Add Latency predictor to EPP #1161

kaushikmitr commented Jul 15, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

nirrozenbaum Jul 15, 2025

Uh oh!

kfswain Jul 15, 2025

Uh oh!

kaushikmitr Jul 16, 2025

Uh oh!

kfswain commented Jul 17, 2025

Uh oh!

Uh oh!

[WIP] Add Latency predictor to EPP #1161

Are you sure you want to change the base?

[WIP] Add Latency predictor to EPP #1161

Conversation

kaushikmitr commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What’s Added

What’s Changed

Observability & Metrics

Uh oh!

netlify bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

k8s-ci-robot commented Jul 15, 2025

Uh oh!

nirrozenbaum Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kfswain Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

kaushikmitr Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

kfswain commented Jul 17, 2025

Uh oh!

Uh oh!

kaushikmitr commented Jul 15, 2025 •

edited

Loading

netlify bot commented Jul 15, 2025 •

edited

Loading