Skip to content

[WIP] Add Latency predictor to EPP #1161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kaushikmitr
Copy link
Contributor

@kaushikmitr kaushikmitr commented Jul 15, 2025

Summary

This PR wires a complete latency‑prediction workflow into the EPP, which we will need for SLO‑aware scheduling decisions and continuous online retraining of TTFT/TPOT pediction models. It introduces two new Python sidecars (prediction + training), a Go async client, and hooks at every stage of the request lifecycle—from startup flags through scheduling, streaming, and response—for both forecasting and recording real‑world latencies. It also updates the EPP deployment manifest to include these sidecars.


What’s Added

  1. Latency‑Predictor Sidecars (latencypredictor-v1/)

    • prediction_server.py

      • FastAPI service exposing /predict, /status, /reload

      • Loads/scales XGBoost or Bayesian Ridge models, returns point estimates + uncertainty bounds

      • Background ModelSyncer for pulling new artifacts from the training service

    • training_server.py

      • FastAPI service exposing /sample, /model/{name}/download, /model/{name}/info, /status

      • Ingests observed TTFT/TPOT samples into bounded deques

      • Periodically retrains models (Bayesian Ridge or XGBoost), evaluates on held‑out data, publishes updated artifacts

  2. Go Async Client (pkg/epp/latencypredictorasync/latencypredictor_async.go)

    • Buffers and bulk‑flushes training samples to the training sidecar

    • Load‑balances and issues /predict requests for scheduling and streaming hooks

    • Periodically refreshes model metadata and raw metrics

    • Exposes Predict(ctx, req) and AddTrainingDataBulk(entries) interfaces

  3. EPP Deployment Manifest (config/manifests/inferencepool-resources-lp.yaml)

    • Defines the EPP Deployment with both the prediction and training sidecar containers alongside the main gateway container

    • Configures container images, health‑checks, volumes, and environment variables for the latency‑predictor services

    • Lives in config/manifests/inferencepool-resources-lp.yaml for operator reference


What’s Changed

  1. EPP Bootstrap & Server Runner

    • cmd/epp/runner/runner.go

      • New --enable-latency-predictor flag / ENV var

      • Initializes predictor client & registers its background flush/refresh loop

      • Injects predictor into requestcontrol.NewDirectorWithConfig and ExtProcServerRunner

    • pkg/epp/server/runserver.go

      • Extended ExtProcServerRunner struct to carry the LatencyPredictor interface
  2. Director & Request Flow

    • director.go & latencypredictor_helper.go

      • At scheduling time: call predictor with pod metrics + request features, store PredictedTTFTForScheduling and PredictedTPOTForScheduling on ReqCtx

      • At response time: in HandleResponseHeaders/HandleResponseBodyChunk, record actual TTFT/TPOT samples, send them to training, and issue further mid‑stream predictions

    • pkg/epp/handlers/server.go

      • Expanded RequestContext with fields for actual vs. predicted TTFT/TPOT, sampling state, and timestamps

      • Emits Prometheus metrics for both observed and forecasted latencies on request completion

    • pkg/epp/handlers/response.go

      • Augments streaming SSE payloads with six new fields:
        ttft_ms, predicted_ttft_ms, tpot_observations_ms, predicted_tpot_observations_ms, avg_tpot_ms, avg_predicted_tpot_ms
  3. Scheduler Enhancements

    • scheduler.go

      • SchedulingResult now returns all ProfileResult entries (not just the single best)

      • This is to surface the “prefix cache score” on each profile so predictor logic and future heuristics can use raw prefix cache scores and other scheduler traces


Observability & Metrics

  • Prometheus histograms for prediction‑service call latency and error rates

  • Counters/gauges for sample ingestion, model‑retrain cycles, and model freshness

  • End‑to‑end metrics comparing actual vs. predicted TTFT/TPOT

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 15, 2025
Copy link

netlify bot commented Jul 15, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit a618d85
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/687c629195bab600087d3681
😎 Deploy Preview https://deploy-preview-1161--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g July 15, 2025 02:26
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kaushikmitr
Once this PR has been reviewed and has the lgtm label, please assign danehans for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from robscott July 15, 2025 02:26
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2025
Comment on lines 75 to 86
type ProfileRunResult struct {
TargetPods []Pod
// RawScores is a map of raw scores for each pod, keyed by scorer type.
RawScores map[string]map[Pod]float64
}

// SchedulingResult captures the result of the scheduling cycle.
type SchedulingResult struct {
ProfileResults map[string]*ProfileRunResult
AllProfileRunResults map[string]*ProfileRunResult
PrimaryProfileName string
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably discuss these changes (cc @kfswain).
the new scheduler extension points and the mechanism around it was designed very carefully, while these changes break some of the concepts of the new scheduler design. for example, one guiding principle was that scores are internal to the scheduler (not returned to caller).

I understand that this task may add a new requirement that we haven't considered, but we should be careful not to break the guiding principles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, while this is a WIP branch. I would expect the final state to be conformant to how other plugins operate or make a case for why the interface should be expanded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a WIP PR and at this point we are still at POC stage. Just created this PR for early alignment. Our next step is to demonstrate that the latency predictor can achieve SLO aware routing before we finalize the design on how it fits inside EPP. Here is a doc I shared with auto-scaling llm-d team. https://docs.google.com/document/d/1q56wr3N5XGx0B21MzHu5oBsCiGi9VrbZAvyhP2VFG_c/edit?tab=t.0

@kfswain kfswain marked this pull request as draft July 17, 2025 22:38
@kfswain
Copy link
Collaborator

kfswain commented Jul 17, 2025

Converted the PR to draft in addition to WIP tag to dissuade review. This is still in progress, and shouldn't be considered for review unless explicitly asked by the authors to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants