Skip to content

Conversation

@bsurber
Copy link
Contributor

@bsurber bsurber commented Jul 31, 2025

Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory.

Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index.

This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each.

Additional Description:

  • The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry).
  • Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread.
    • Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update).
    • A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories.
  • Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration.

Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…nts & bucket caches based on their configured RLQS server destination & domain.

Signed-off-by: Brian Surber <[email protected]>
@repokitteh-read-only
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #40497 was opened by bsurber.

see: more, trace.

@bsurber bsurber marked this pull request as ready for review July 31, 2025 17:25
@bsurber bsurber requested review from tyxia and yanavlasov as code owners July 31, 2025 17:25
@bsurber
Copy link
Contributor Author

bsurber commented Jul 31, 2025

@sergiitk can review the implementation to ensure it matches the design consensus.

@bsurber
Copy link
Contributor Author

bsurber commented Jul 31, 2025

/retest

@bsurber
Copy link
Contributor Author

bsurber commented Jul 31, 2025

/assign sergiitk

@repokitteh-read-only
Copy link

bsurber is not allowed to assign users.

🐱

Caused by: a #40497 (comment) was created by @bsurber.

see: more, trace.

@tyxia
Copy link
Member

tyxia commented Aug 4, 2025

@sergiitk could please take a first pass? Thanks!

@phlax
Copy link
Member

phlax commented Aug 7, 2025

@bsurber failures look real

@bsurber
Copy link
Contributor Author

bsurber commented Aug 12, 2025

Ah, I see the leaked mocks complaint from //test/config_test:example_configs_test under the various oauth errors now. I'll have to take a look.

@bsurber
Copy link
Contributor Author

bsurber commented Aug 13, 2025

It's complaining about a MockTimer getting left around. If the TestUsingSimulatedTime framework creates Timers as MockTimers, that'd make sense.
I'll try adding a deleteIsPending() func to TlsStore to make sure that its underlying global client & cleanup timer are also added for deferred deletion.

@bsurber
Copy link
Contributor Author

bsurber commented Aug 13, 2025

Actually no, that wouldn't make sense, as traffic would have to be hitting RLQS buckets to initialize the TlsStores map, create a bucket's TLS Store, and initialize the garbage_collector timer. I'm back to being confused as that wouldn't make sense for the ExampleConfigsTest.All test.
Right, the TLS Store cleanup timer would have been created if an RLQS filter factory was initialized, as opposed to buckets in the cache, which would require actual traffic.

@bsurber
Copy link
Contributor Author

bsurber commented Aug 14, 2025

I'm experimenting with removing the garbage collection timer & instead keeping a weak_ptr in the global map. The TLS Store will then be set to remove its index from the global map when all the owner filter factories have stopped using it.
This is an experiment because I want to confirm that only the main thread (or a test thread) is doing the shared_ptr cleanup, as any other thread wouldn't be safe: I expect so as only the main thread deletes & creates filter factories, and so is the only thread that would change TlsStore ownership.

…ly saving weak_ptrs to the map & setting TlsStore destructor logic to handle cleanup.

Signed-off-by: Brian Surber <[email protected]>
@bsurber
Copy link
Contributor Author

bsurber commented Aug 15, 2025

Ah, client_test is hitting a strict mock's complaints with the new resetStream() call during shutdown.

…s any ongoing stream when tearing down at the end of tests.

Signed-off-by: Brian Surber <[email protected]>
@bsurber
Copy link
Contributor Author

bsurber commented Aug 15, 2025

/coverage

@repokitteh-read-only
Copy link

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/40497/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #40497 (comment) was created by @bsurber.

see: more, trace.

Copy link
Contributor

@sergiitk sergiitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bucket_matchers:
matcher_list:
matchers:
# Assign requests with header['env'] set to 'staging' to the bucket { name: 'staging' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem relevant for this case - should it be updated to indicate the filter contains an invalid matcher (because of the type mismatch)?

Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tyxia tyxia changed the title Persist rate_limit_quota filter state & centralize filters to share state rlqs: Persist rate_limit_quota filter state & centralize filters to share state Aug 21, 2025
@tyxia tyxia merged commit f4ae9d9 into envoyproxy:main Aug 21, 2025
24 checks passed
melginaldi pushed a commit to melginaldi/envoy that referenced this pull request Aug 21, 2025
…hare state (envoyproxy#40497)

Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter
config updates (e.g. via LDS). This is done by aggregating
rate_limit_quota filters to share RLQS global clients & bucket caches
based on their configured RLQS server destination & domain, instead of
creating a new global client + cache per filter factory.

Now, the TlsStores (global client + tls slots) are referenced via
weak_ptrs in a static map & owned by filter factories. If all filter
factories drop and stop owning a TlsStore, its shared_ptr will trigger
destruction of the global resources & index.

This persistence has a positive side-effect of centralizing usage
reporting + assignment generation from the RLQS server's perspective,
while still allowing for separation between filter configs via the
domain field if needed, e.g. if 2 filter chains send traffic to
different upstreams & want separate rate limit assignments for each.

Additional Description:
- The design for the global, static map came from the registration model
for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry).
- Updates to the global map (removals or additions of indices) are not
thread-safe, and so are always done from the main thread.
- Additions occur during filter factory creation, if needed, which is
handled by the main thread (e.g. during startup or when triggered by an
LDS update).
- A garbage collection timer (every 10s) on the main thread handles
removal of any map indices that are no longer referenced by any active
filter factories.
- Map indexing does not account for differences in gRPC client
configurations (excluding RLQS server destination), such as differences
in timeouts. The global RLQS client will be created according to the
first-seen configuration.

Risk Level: Minimal to moderate (thread-safety warrants scrutiny but
changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter
replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Brian Surber <[email protected]>
Signed-off-by: Melissa Ginaldi <[email protected]>
melginaldi pushed a commit to melginaldi/envoy that referenced this pull request Aug 26, 2025
…hare state (envoyproxy#40497)

Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter
config updates (e.g. via LDS). This is done by aggregating
rate_limit_quota filters to share RLQS global clients & bucket caches
based on their configured RLQS server destination & domain, instead of
creating a new global client + cache per filter factory.

Now, the TlsStores (global client + tls slots) are referenced via
weak_ptrs in a static map & owned by filter factories. If all filter
factories drop and stop owning a TlsStore, its shared_ptr will trigger
destruction of the global resources & index.

This persistence has a positive side-effect of centralizing usage
reporting + assignment generation from the RLQS server's perspective,
while still allowing for separation between filter configs via the
domain field if needed, e.g. if 2 filter chains send traffic to
different upstreams & want separate rate limit assignments for each.

Additional Description:
- The design for the global, static map came from the registration model
for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry).
- Updates to the global map (removals or additions of indices) are not
thread-safe, and so are always done from the main thread.
- Additions occur during filter factory creation, if needed, which is
handled by the main thread (e.g. during startup or when triggered by an
LDS update).
- A garbage collection timer (every 10s) on the main thread handles
removal of any map indices that are no longer referenced by any active
filter factories.
- Map indexing does not account for differences in gRPC client
configurations (excluding RLQS server destination), such as differences
in timeouts. The global RLQS client will be created according to the
first-seen configuration.

Risk Level: Minimal to moderate (thread-safety warrants scrutiny but
changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter
replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Brian Surber <[email protected]>
Signed-off-by: Melissa Ginaldi <[email protected]>
wtzhang23 pushed a commit to wtzhang23/envoy that referenced this pull request Aug 27, 2025
…hare state (envoyproxy#40497)

Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter
config updates (e.g. via LDS). This is done by aggregating
rate_limit_quota filters to share RLQS global clients & bucket caches
based on their configured RLQS server destination & domain, instead of
creating a new global client + cache per filter factory.

Now, the TlsStores (global client + tls slots) are referenced via
weak_ptrs in a static map & owned by filter factories. If all filter
factories drop and stop owning a TlsStore, its shared_ptr will trigger
destruction of the global resources & index.

This persistence has a positive side-effect of centralizing usage
reporting + assignment generation from the RLQS server's perspective,
while still allowing for separation between filter configs via the
domain field if needed, e.g. if 2 filter chains send traffic to
different upstreams & want separate rate limit assignments for each.

Additional Description:
- The design for the global, static map came from the registration model
for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry).
- Updates to the global map (removals or additions of indices) are not
thread-safe, and so are always done from the main thread.
- Additions occur during filter factory creation, if needed, which is
handled by the main thread (e.g. during startup or when triggered by an
LDS update).
- A garbage collection timer (every 10s) on the main thread handles
removal of any map indices that are no longer referenced by any active
filter factories.
- Map indexing does not account for differences in gRPC client
configurations (excluding RLQS server destination), such as differences
in timeouts. The global RLQS client will be created according to the
first-seen configuration.

Risk Level: Minimal to moderate (thread-safety warrants scrutiny but
changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter
replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Brian Surber <[email protected]>
paul-r-gall pushed a commit that referenced this pull request Sep 20, 2025
…Stream (#41053)

Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl`
(`stream_`) doesn't actually own the underlying raw stream ptr. This was
causing a race condition during shutdown, with the cluster-manager's
deferred stream reset+deletion racing against the global client's
deferred deletion. If the deferred global client deletion triggered
first, without resetting the stream, then the cluster-manager would fail
in its own stream reset attempt (the stream's callbacks having been
deleted with the global client). If the global client guarantees stream
reset + deletion, and the cluster manager wins the race, then the global
client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the `GlobalRateLimitClientImpl` can
instead own its `RawAsyncClient` & delete it to guarantee that any of
its active streams are cleaned up.

-------- 
Additional Description: With the owned RawAsyncClient, integration
testing saw a new flake where sometimes the first connection to a fake
upstream failed immediately with an empty-message internal error. This
was addressed by adding `waitForRlqsStream()` to check all fake upstream
connections for new streams, not just the first.

--------
Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test
run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR #40497

---------

Signed-off-by: Brian Surber <[email protected]>
mbadov pushed a commit to mbadov/envoy that referenced this pull request Sep 22, 2025
…Stream (envoyproxy#41053)

Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl`
(`stream_`) doesn't actually own the underlying raw stream ptr. This was
causing a race condition during shutdown, with the cluster-manager's
deferred stream reset+deletion racing against the global client's
deferred deletion. If the deferred global client deletion triggered
first, without resetting the stream, then the cluster-manager would fail
in its own stream reset attempt (the stream's callbacks having been
deleted with the global client). If the global client guarantees stream
reset + deletion, and the cluster manager wins the race, then the global
client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the `GlobalRateLimitClientImpl` can
instead own its `RawAsyncClient` & delete it to guarantee that any of
its active streams are cleaned up.

-------- 
Additional Description: With the owned RawAsyncClient, integration
testing saw a new flake where sometimes the first connection to a fake
upstream failed immediately with an empty-message internal error. This
was addressed by adding `waitForRlqsStream()` to check all fake upstream
connections for new streams, not just the first.

--------
Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test
run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR envoyproxy#40497

---------

Signed-off-by: Brian Surber <[email protected]>
Signed-off-by: Misha Badov <[email protected]>
lucaschimweg pushed a commit to lucaschimweg/envoy that referenced this pull request Sep 23, 2025
…Stream (envoyproxy#41053)

Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl`
(`stream_`) doesn't actually own the underlying raw stream ptr. This was
causing a race condition during shutdown, with the cluster-manager's
deferred stream reset+deletion racing against the global client's
deferred deletion. If the deferred global client deletion triggered
first, without resetting the stream, then the cluster-manager would fail
in its own stream reset attempt (the stream's callbacks having been
deleted with the global client). If the global client guarantees stream
reset + deletion, and the cluster manager wins the race, then the global
client's reset + deletion fails with heap-use-after-free.

To get around this race condition, the `GlobalRateLimitClientImpl` can
instead own its `RawAsyncClient` & delete it to guarantee that any of
its active streams are cleaned up.

-------- 
Additional Description: With the owned RawAsyncClient, integration
testing saw a new flake where sometimes the first connection to a fake
upstream failed immediately with an empty-message internal error. This
was addressed by adding `waitForRlqsStream()` to check all fake upstream
connections for new streams, not just the first.

--------
Risk Level:
Testing: Unit & integration. integration_test & filter_persistence_test
run 500 times to check for flakes.
Docs Changes:
Release Notes:
Platform Specific Features:
Fixes ASAN flake from PR envoyproxy#40497

---------

Signed-off-by: Brian Surber <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants