feat(flowcontrol): Implement registry shard #1187

LukeAVanDrie · 2025-07-17T23:31:17Z

This work tracks #674

This pull request introduces the operational layer of the new sharded Flow Registry architecture. It delivers the concurrent, high-performance "data plane" that each FlowController worker will interact with, paving the way for the top-level administrative control plane (FlowRegistry) to be built in a subsequent PR.

The core architectural principle here is the separation of concerns between the global registry's administrative functions and the per-worker operational view. This PR focuses exclusively on the latter by implementing the registryShard.

The key components introduced are:

registry.Config: A new configuration object that defines the master layout for the entire registry. It includes robust validation for checking compatibility between policies and queues, defaulting logic, and a partition() method to distribute capacity allocations across all shards.
registry.registryShard: This is the heart of the PR. It is a concrete and concurrent-safe implementation of the contracts.RegistryShard interface. It provides a FlowController worker with its partitioned view of the system state, managing all flow queues and policies within its boundary. Concurrency is managed via a RWMutex for structural map changes and lock-free atomic operations for all statistics, ensuring high performance on the hot path.
registry.managedQueue: A critical decorator that wraps a framework.SafeQueue. It serves two essential functions in the architecture:
1. Atomic Statistics: It maintains its own counters and uses a callback to atomically reconcile its state with its parent shard. This is the key to enabling lock-free, aggregated statistics at the shard and global level.
2. Lifecycle Enforcement: It tracks the queue's state (active vs. draining). This is a forward-looking feature that is essential for enabling graceful administrative actions in the future, such as changing a flow's priority or decommissioning a shard without losing requests.
Contracts and Errors: New sentinel errors (ErrFlowInstanceNotFound, ErrPriorityBandNotFound, ErrPolicyQueueIncompatible) have been added to the contracts package to create a stable and predictable API for consumers of the registry.

Testing

All new components are accompanied by comprehensive unit tests, which can be found in pkg/epp/flowcontrol/registry/. The tests cover:

Configuration validation and partitioning logic.
managedQueue behavior, including statistics reconciliation and concurrency safety.
registryShard accessors, error paths, and statistics aggregation.

This commit introduces the core operational layer of the Flow Registry's sharded architecture. It deliberately focuses on implementing the `registryShard`—the concurrent, high-performance data plane for a single worker—as opposed to the top-level `FlowRegistry` administrative control plane, which will be built upon this foundation. The `registryShard` provides the concrete implementation of the `contracts.RegistryShard` port, giving each `FlowController` worker a safe, partitioned view of the system's state. This design is fundamental to achieving scalability by minimizing cross-worker contention on the hot path. The key components are: - **`registry.Config`**: The master configuration blueprint for the entire `FlowRegistry`. It is validated once and then partitioned, with each shard receiving its own slice of the configuration, notably for capacity limits. - **`registry.registryShard`**: The operational heart of this commit. It manages the lifecycle of queues and policies within a single shard, providing the read-oriented access needed by a `FlowController` worker. It ensures concurrency safety through a combination of mutexes for structural changes and lock-free atomics for statistics. - **`registry.managedQueue`**: A stateful decorator that wraps a raw `framework.SafeQueue`. Its two primary responsibilities are to enable the sharded model by providing atomic, upwardly-reconciled statistics, and to enforce lifecycle state (active vs. draining), which is essential for the graceful draining of flows during future administrative updates. - **Contracts and Errors**: New sentinel errors are added to the `contracts` package to create a clear, stable API boundary between the registry and its consumers. This work establishes the robust, scalable, and concurrent foundation upon which the top-level `FlowRegistry` administrative interface will be built.

netlify · 2025-07-17T23:31:22Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`a58d7b3`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/687987c743a27b0008c60c2d
😎 Deploy Preview	https://deploy-preview-1187--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-07-17T23:31:27Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2025-07-17T23:33:14Z

/assign @ahg-g

/assign @kfswain

ahg-g · 2025-07-18T05:35:32Z

/ok-to-test

How are we sharding exactly?

LukeAVanDrie · 2025-07-18T05:56:34Z

/ok-to-test

How are we sharding exactly?

Great question. Here’s the high-level overview of the sharding strategy.

The Short Answer

We shard by running multiple parallel worker loops (shardProcessors) within a single Flow Controller instance. Each logical flow (e.g., for a tenant) gets a dedicated queue on every shard.

When a request arrives, a top-level distributor sends it to the shard with the lightest load for that specific flow. This prevents a central bottleneck and allows us to process requests in parallel.

Think of it as having multiple, independent dispatchers instead of just one.

How it Works: Join-the-Shortest-Queue-by-Bytes (JSQ-Bytes)

The distributor uses a Join-the-Shortest-Queue-by-Bytes (JSQ-Bytes) algorithm.

A request arrives for Flow F.
The distributor inspects the queue for Flow F on every shard (q_F_1, q_F_2, ... q_F_n).
It sends the request to the shard whose queue for F currently has the fewest total bytes.

This balances the load (in terms of memory pressure and queuing capacity) across the shards in real-time.

Current Status & Next Steps

This PR implements the registryShard structure, which makes this parallel processing possible. Subsequent PRs will introduce the shardProcessor (worker) and the top-level distributor that implements the JSQ-Bytes logic (FlowController). For now, the system is configured to run with a single shard, but the architecture is ready for expansion.

Deeper Dive on Design Rationale

The design of the Flow Controller is built on this sharded architecture to enable parallel processing and prevent the central dispatch loop from becoming a bottleneck at high request rates.

Critical Assumption: Workload Homogeneity Within Flows

The effectiveness of the sharded model hinges on a critical assumption: while the system as a whole manages a heterogeneous set of flows, the traffic within a single logical flow is assumed to be roughly homogeneous. A logical flow represents a single workload or tenant, so we expect request characteristics to be statistically similar within that flow.

The JSQ-Bytes distribution makes this assumption robust by:

Reflecting True Load: It distributes work based on each shard's current queue size in bytes—a direct measure of its memory and capacity congestion.
Adapting to Real-Time Congestion: It adaptively steers new work away from momentarily struggling workers.
Hedging Against Assumption Violations: This adaptive, self-correcting nature makes it a powerful hedge. It doesn't just distribute; it actively load balances based on the most relevant feedback available.

Stateful Policies in a Sharded Registry

Sharding means that when the critical assumption holds, the independent actions of policies on each shard result in emergent, approximate global fairness. Achieving true, deterministic global fairness would require policies to depend on an external, globally-consistent state store, which adds significant complexity. The shard_count will be a key operational lever to tune this behavior, especially for low-QPS flows.

ahg-g · 2025-07-18T20:22:51Z

The shard_count will be a key operational lever to tune this behavior, especially for low-QPS flows.

worth considering a waterfall algorithm, meaning create multiple shards, but the requests should be added to one shard until it is at capacity and then spill to the next one.

ahg-g · 2025-07-18T20:24:35Z

This PR is fine, but again, hard to reason about without the full picture.

/lgtm
/approve

k8s-ci-robot · 2025-07-18T20:24:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2025-07-18T20:43:26Z

The shard_count will be a key operational lever to tune this behavior, especially for low-QPS flows.

worth considering a waterfall algorithm, meaning create multiple shards, but the requests should be added to one shard until it is at capacity and then spill to the next one.

That's a great point, and it's a valid architectural pattern for certain types of systems. I considered a "fill-and-spill" (waterfall) approach during the design phase, but I concluded that a "spread" approach using JSQ-Bytes is a better fit for the specific goals of the Flow Controller.

The core trade-off comes down to what we are optimizing for.

A waterfall algorithm is excellent for systems that need to optimize for resource packing or locality. However, the primary goals of our Flow Controller are high throughput, low latency, and fairness between concurrent flows. The waterfall model presents challenges for all three of these goals in our specific context:

It Re-introduces a Single-Threaded Bottleneck: The main motivation for sharding is to parallelize the dispatch loop. A waterfall model effectively serializes all requests through a single "hot" shard until it reaches capacity. This would re-introduce the very bottleneck we're trying to eliminate, capping our throughput at the speed of a single worker.
It Makes Emergent Fairness Impossible: This is the most critical point. Our design relies on emergent, approximate global fairness, where fair policies running on each of the N shards produce a globally fair outcome over time. This only works if each shard receives a statistically representative sample of the overall traffic. The waterfall model breaks this prerequisite. If Shard 1 receives 100% of the traffic, the other N-1 shards have a completely stale and unrepresentative view of the traffic mix. When traffic finally spills over, their policies—especially stateful ones—will make decisions based on a completely outdated context, undermining any global fairness objective.
It Can Create Instability: The waterfall model can lead to oscillating behavior. The system would operate with one hot shard and N-1 cold shards. When the hot shard finally drains, all traffic would suddenly "thunder" over to the next shard in line, making it the new hot spot. This is less stable and predictable than the JSQ-Bytes approach, which gracefully and continuously balances load across all available workers.

To summarize the trade-off:

Waterfall (Fill-and-Spill) Model:
- Optimizes For: Resource Packing, Locality.
- Behavior: Creates a single "hot" shard, serializing requests.
- Impact on Fairness: Undermines it. Creates skewed, unrepresentative views for policies, rendering stateful policies ineffective.
- Best For: Batch systems, systems with expensive, stateful workers.
JSQ-Bytes (Spread) Model:
- Optimizes For: Parallelism, Throughput, Fairness.
- Behavior: Spreads load across all shards based on real-time congestion.
- Impact on Fairness: Enables it. Ensures all shards see a representative traffic mix, allowing local policies to produce an emergent, globally fair outcome.
- Best For: High-throughput, low-latency, fairness-sensitive systems like ours.

For these reasons, I strongly believe that our current JSQ-Bytes "spread" approach is the correct choice to meet the core requirements of this component.

LukeAVanDrie · 2025-07-18T20:53:05Z

This PR is fine, but again, hard to reason about without the full picture.

/lgtm /approve

Thanks for the LGTM and approval, Abdullah!

I agree, this registryShard component is abstract on its own. To give you that "full picture," this PR establishes the core data plane for the FlowController. It's the high-performance, concurrent-safe container that holds all the flow state.

The next PR I'm preparing introduces the execution engine that will consume this state. The reason for the registryShard's specific, read-oriented API is to safely and efficiently serve those parallel flow control workers.

Once you see how a worker uses the contracts.RegistryShard interface in the next review, the design choices here will feel much more concrete.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 17, 2025

k8s-ci-robot requested review from danehans and nirrozenbaum July 17, 2025 23:31

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 17, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 17, 2025

k8s-ci-robot assigned ahg-g and kfswain Jul 17, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 18, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 18, 2025

k8s-ci-robot merged commit 6cf8f31 into kubernetes-sigs:main Jul 18, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(flowcontrol): Implement registry shard #1187

feat(flowcontrol): Implement registry shard #1187

LukeAVanDrie commented Jul 17, 2025

Uh oh!

netlify bot commented Jul 17, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 17, 2025

Uh oh!

LukeAVanDrie commented Jul 17, 2025

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

LukeAVanDrie commented Jul 18, 2025 •

edited

Loading

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

k8s-ci-robot commented Jul 18, 2025

Uh oh!

LukeAVanDrie commented Jul 18, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented Jul 18, 2025

Uh oh!

Uh oh!

feat(flowcontrol): Implement registry shard #1187

feat(flowcontrol): Implement registry shard #1187

Conversation

LukeAVanDrie commented Jul 17, 2025

Testing

Uh oh!

netlify bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jul 17, 2025

Uh oh!

LukeAVanDrie commented Jul 17, 2025

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

LukeAVanDrie commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deeper Dive on Design Rationale

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

ahg-g commented Jul 18, 2025

Uh oh!

k8s-ci-robot commented Jul 18, 2025

Uh oh!

LukeAVanDrie commented Jul 18, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented Jul 18, 2025

Uh oh!

Uh oh!

netlify bot commented Jul 17, 2025 •

edited

Loading

LukeAVanDrie commented Jul 18, 2025 •

edited

Loading