rlqs: Persist rate_limit_quota filter state & centralize filters to share state #40497

bsurber · 2025-07-31T01:05:05Z

Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory.

Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index.

This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each.

Additional Description:

The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry).
Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread.
- Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update).
- A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories.
Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration.

Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…nts & bucket caches based on their configured RLQS server destination & domain. Signed-off-by: Brian Surber <[email protected]>

repokitteh-read-only · 2025-07-31T01:05:11Z

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #40497 was opened by bsurber.

see: more, trace.

bsurber · 2025-07-31T17:26:42Z

@sergiitk can review the implementation to ensure it matches the design consensus.

bsurber · 2025-07-31T17:27:57Z

/retest

bsurber · 2025-07-31T17:30:04Z

/assign sergiitk

repokitteh-read-only · 2025-07-31T17:30:09Z

bsurber is not allowed to assign users.

🐱

Caused by: a #40497 (comment) was created by @bsurber.

see: more, trace.

tyxia · 2025-08-04T16:08:12Z

@sergiitk could please take a first pass? Thanks!

phlax · 2025-08-07T08:24:59Z

@bsurber failures look real

bsurber · 2025-08-12T00:08:58Z

Ah, I see the leaked mocks complaint from //test/config_test:example_configs_test under the various oauth errors now. I'll have to take a look.

Signed-off-by: Brian Surber <[email protected]>

bsurber · 2025-08-13T20:16:07Z

It's complaining about a MockTimer getting left around. If the TestUsingSimulatedTime framework creates Timers as MockTimers, that'd make sense.
~~I'll try adding a deleteIsPending() func to TlsStore to make sure that its underlying global client & cleanup timer are also added for deferred deletion.~~

bsurber · 2025-08-13T23:20:05Z

Actually no, that wouldn't make sense, as traffic would have to be hitting RLQS buckets to initialize the TlsStores map, create a bucket's TLS Store, and initialize the garbage_collector timer. I'm back to being confused as that wouldn't make sense for the ExampleConfigsTest.All test.
Right, the TLS Store cleanup timer would have been created if an RLQS filter factory was initialized, as opposed to buckets in the cache, which would require actual traffic.

bsurber · 2025-08-14T21:07:55Z

I'm experimenting with removing the garbage collection timer & instead keeping a weak_ptr in the global map. The TLS Store will then be set to remove its index from the global map when all the owner filter factories have stopped using it.
This is an experiment because I want to confirm that only the main thread (or a test thread) is doing the shared_ptr cleanup, as any other thread wouldn't be safe: I expect so as only the main thread deletes & creates filter factories, and so is the only thread that would change TlsStore ownership.

…ly saving weak_ptrs to the map & setting TlsStore destructor logic to handle cleanup. Signed-off-by: Brian Surber <[email protected]>

bsurber · 2025-08-15T00:38:49Z

Ah, client_test is hitting a strict mock's complaints with the new resetStream() call during shutdown.

…s any ongoing stream when tearing down at the end of tests. Signed-off-by: Brian Surber <[email protected]>

bsurber · 2025-08-15T19:09:26Z

/coverage

repokitteh-read-only · 2025-08-15T19:09:32Z

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/40497/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #40497 (comment) was created by @bsurber.

see: more, trace.

…erage changes Signed-off-by: Brian Surber <[email protected]>

…n untriggerable failure. Signed-off-by: Brian Surber <[email protected]>

Signed-off-by: Brian Surber <[email protected]>

sergiitk

LGTM

sergiitk · 2025-08-20T21:15:37Z

test/extensions/filters/http/rate_limit_quota/config_test.cc

+  bucket_matchers:
+    matcher_list:
+      matchers:
+        # Assign requests with header['env'] set to 'staging' to the bucket { name: 'staging' }


This comment doesn't seem relevant for this case - should it be updated to indicate the filter contains an invalid matcher (because of the type mismatch)?

tyxia

Thanks!

…hare state (envoyproxy#40497) Commit Message: Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory. Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index. This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each. Additional Description: - The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry). - Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread. - Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update). - A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories. - Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration. Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter) Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting). Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Brian Surber <[email protected]> Signed-off-by: Melissa Ginaldi <[email protected]>

…hare state (envoyproxy#40497) Commit Message: Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory. Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index. This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each. Additional Description: - The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry). - Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread. - Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update). - A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories. - Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration. Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter) Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting). Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Brian Surber <[email protected]>

…Stream (#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR #40497 --------- Signed-off-by: Brian Surber <[email protected]>

…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <[email protected]> Signed-off-by: Misha Badov <[email protected]>

…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <[email protected]>

Aggregate rate_limit_quota filter instances to share RLQS global clie…

627184f

…nts & bucket caches based on their configured RLQS server destination & domain. Signed-off-by: Brian Surber <[email protected]>

bsurber marked this pull request as ready for review July 31, 2025 17:25

bsurber requested review from tyxia and yanavlasov as code owners July 31, 2025 17:25

phlax assigned tyxia Aug 4, 2025

Merge remote-tracking branch 'upstream/main' into persist-filter-state

cadc2c5

Signed-off-by: Brian Surber <[email protected]>

Replace the garbage collection timers in the static global map, by on…

93e15bf

…ly saving weak_ptrs to the map & setting TlsStore destructor logic to handle cleanup. Signed-off-by: Brian Surber <[email protected]>

Update the mock stream to expect a reset. The global client now reset…

4d3d214

…s any ongoing stream when tearing down at the end of tests. Signed-off-by: Brian Surber <[email protected]>

bsurber added 3 commits August 15, 2025 20:33

Add coverage for the RLQS config validation visitor & other minor cov…

cd5ccdb

…erage changes Signed-off-by: Brian Surber <[email protected]>

Improve coverage by mocking gRPC client creation failure & removing a…

4545f13

…n untriggerable failure. Signed-off-by: Brian Surber <[email protected]>

Spelling fixes

0e30e4d

Signed-off-by: Brian Surber <[email protected]>

sergiitk approved these changes Aug 20, 2025

View reviewed changes

tyxia approved these changes Aug 21, 2025

View reviewed changes

tyxia changed the title ~~Persist rate_limit_quota filter state & centralize filters to share state~~ rlqs: Persist rate_limit_quota filter state & centralize filters to share state Aug 21, 2025

tyxia merged commit f4ae9d9 into envoyproxy:main Aug 21, 2025
24 checks passed

bsurber mentioned this pull request Sep 11, 2025

rate_limit_quota: fix ASAN flake in GlobalRateLimitClientImpl's RLQS Stream #41053

Merged

rlqs: Persist rate_limit_quota filter state & centralize filters to share state #40497

rlqs: Persist rate_limit_quota filter state & centralize filters to share state #40497

Uh oh!

Conversation

bsurber commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Jul 31, 2025

Uh oh!

bsurber commented Jul 31, 2025

Uh oh!

bsurber commented Jul 31, 2025

Uh oh!

bsurber commented Jul 31, 2025

Uh oh!

repokitteh-read-only bot commented Jul 31, 2025

Uh oh!

tyxia commented Aug 4, 2025

Uh oh!

phlax commented Aug 7, 2025

Uh oh!

bsurber commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsurber commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsurber commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsurber commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsurber commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bsurber commented Aug 15, 2025

Uh oh!

repokitteh-read-only bot commented Aug 15, 2025

Uh oh!

sergiitk left a comment

Choose a reason for hiding this comment

Uh oh!

sergiitk Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

tyxia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bsurber commented Jul 31, 2025 •

edited

Loading

bsurber commented Aug 12, 2025 •

edited

Loading

bsurber commented Aug 13, 2025 •

edited

Loading

bsurber commented Aug 13, 2025 •

edited

Loading

bsurber commented Aug 14, 2025 •

edited

Loading

bsurber commented Aug 15, 2025 •

edited

Loading