-
Notifications
You must be signed in to change notification settings - Fork 5.2k
rlqs: Persist rate_limit_quota filter state & centralize filters to share state #40497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nts & bucket caches based on their configured RLQS server destination & domain. Signed-off-by: Brian Surber <[email protected]>
|
@sergiitk can review the implementation to ensure it matches the design consensus. |
|
/retest |
|
/assign sergiitk |
|
bsurber is not allowed to assign users. |
|
@sergiitk could please take a first pass? Thanks! |
|
@bsurber failures look real |
|
Ah, I see the leaked mocks complaint from |
Signed-off-by: Brian Surber <[email protected]>
|
It's complaining about a MockTimer getting left around. If the TestUsingSimulatedTime framework creates Timers as MockTimers, that'd make sense. |
|
|
|
I'm experimenting with removing the garbage collection timer & instead keeping a weak_ptr in the global map. The TLS Store will then be set to remove its index from the global map when all the owner filter factories have stopped using it. |
…ly saving weak_ptrs to the map & setting TlsStore destructor logic to handle cleanup. Signed-off-by: Brian Surber <[email protected]>
|
Ah, client_test is hitting a strict mock's complaints with the new resetStream() call during shutdown. |
…s any ongoing stream when tearing down at the end of tests. Signed-off-by: Brian Surber <[email protected]>
|
/coverage |
|
Coverage for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-cncf-pr/40497/coverage/index.html For comparison, current coverage on https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html The coverage results are (re-)rendered each time the CI |
…erage changes Signed-off-by: Brian Surber <[email protected]>
…n untriggerable failure. Signed-off-by: Brian Surber <[email protected]>
Signed-off-by: Brian Surber <[email protected]>
sergiitk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| bucket_matchers: | ||
| matcher_list: | ||
| matchers: | ||
| # Assign requests with header['env'] set to 'staging' to the bucket { name: 'staging' } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't seem relevant for this case - should it be updated to indicate the filter contains an invalid matcher (because of the type mismatch)?
tyxia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…hare state (envoyproxy#40497) Commit Message: Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory. Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index. This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each. Additional Description: - The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry). - Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread. - Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update). - A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories. - Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration. Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter) Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting). Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Brian Surber <[email protected]> Signed-off-by: Melissa Ginaldi <[email protected]>
…hare state (envoyproxy#40497) Commit Message: Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory. Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index. This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each. Additional Description: - The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry). - Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread. - Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update). - A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories. - Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration. Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter) Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting). Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Brian Surber <[email protected]> Signed-off-by: Melissa Ginaldi <[email protected]>
…hare state (envoyproxy#40497) Commit Message: Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory. Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index. This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each. Additional Description: - The design for the global, static map came from the registration model for filter factory cbs (//envoy/registry/registry.h, FactoryRegistry). - Updates to the global map (removals or additions of indices) are not thread-safe, and so are always done from the main thread. - Additions occur during filter factory creation, if needed, which is handled by the main thread (e.g. during startup or when triggered by an LDS update). - A garbage collection timer (every 10s) on the main thread handles removal of any map indices that are no longer referenced by any active filter factories. - Map indexing does not account for differences in gRPC client configurations (excluding RLQS server destination), such as differences in timeouts. The global RLQS client will be created according to the first-seen configuration. Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter) Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting). Docs Changes: Release Notes: Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Brian Surber <[email protected]>
…Stream (#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR #40497 --------- Signed-off-by: Brian Surber <[email protected]>
…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <[email protected]> Signed-off-by: Misha Badov <[email protected]>
…Stream (envoyproxy#41053) Commit Message: The RLQS async stream in `GlobalRateLimitClientImpl` (`stream_`) doesn't actually own the underlying raw stream ptr. This was causing a race condition during shutdown, with the cluster-manager's deferred stream reset+deletion racing against the global client's deferred deletion. If the deferred global client deletion triggered first, without resetting the stream, then the cluster-manager would fail in its own stream reset attempt (the stream's callbacks having been deleted with the global client). If the global client guarantees stream reset + deletion, and the cluster manager wins the race, then the global client's reset + deletion fails with heap-use-after-free. To get around this race condition, the `GlobalRateLimitClientImpl` can instead own its `RawAsyncClient` & delete it to guarantee that any of its active streams are cleaned up. -------- Additional Description: With the owned RawAsyncClient, integration testing saw a new flake where sometimes the first connection to a fake upstream failed immediately with an empty-message internal error. This was addressed by adding `waitForRlqsStream()` to check all fake upstream connections for new streams, not just the first. -------- Risk Level: Testing: Unit & integration. integration_test & filter_persistence_test run 500 times to check for flakes. Docs Changes: Release Notes: Platform Specific Features: Fixes ASAN flake from PR envoyproxy#40497 --------- Signed-off-by: Brian Surber <[email protected]>
Commit Message:
Persist RLQS client state & bucket assignment+usage caches across filter config updates (e.g. via LDS). This is done by aggregating rate_limit_quota filters to share RLQS global clients & bucket caches based on their configured RLQS server destination & domain, instead of creating a new global client + cache per filter factory.
Now, the TlsStores (global client + tls slots) are referenced via weak_ptrs in a static map & owned by filter factories. If all filter factories drop and stop owning a TlsStore, its shared_ptr will trigger destruction of the global resources & index.
This persistence has a positive side-effect of centralizing usage reporting + assignment generation from the RLQS server's perspective, while still allowing for separation between filter configs via the domain field if needed, e.g. if 2 filter chains send traffic to different upstreams & want separate rate limit assignments for each.
Additional Description:
Risk Level: Minimal to moderate (thread-safety warrants scrutiny but changes are to a WIP filter)
Testing: integration & manual testing (config changes + filter replacements shown to not interrupt rate limiting).
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]