-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Dynamic enablement of Per-Partition Automatic Failover #46477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…afDynamicEnablement # Conflicts: # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionAutomaticFailoverE2ETests.java # sdk/cosmos/azure-cosmos/CHANGELOG.md
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…afDynamicEnablement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements dynamic enablement of Per-Partition Automatic Failover (PPAF) in the Azure Cosmos DB Java SDK. The key enhancement allows PPAF to be enabled/disabled at runtime based on service-side configuration, moving away from static client-side configuration.
Key changes include:
- Enhanced GlobalEndpointManager to monitor and respond to database account configuration changes
- Added dynamic PPAF configuration capability through a callback mechanism
- Extended user agent feature flags to include ThinClient and Http2 support
- Improved thread safety in UserAgentContainer with read-write locks
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java |
Added clear() method and synchronized resetCircuitBreakerConfig |
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java |
Added synchronized resetPerPartitionAutomaticFailoverEnabled with state clearing |
UserAgentFeatureFlags.java |
Added ThinClient and Http2 feature flags |
UserAgentContainer.java |
Enhanced thread safety with read-write locks and improved flag handling |
RxGatewayStoreModel.java |
Added gateway response recording for cancelled requests |
RxDocumentClientImpl.java |
Refactored PPAF initialization to support dynamic enablement |
GlobalEndpointManager.java |
Added dynamic PPAF monitoring and callback mechanism |
DiagnosticsClientContext.java |
Updated diagnostics to track dynamic PPAF state |
ReflectionUtils.java |
Added reflection utilities for testing GlobalEndpointManager owner |
PerPartitionAutomaticFailoverE2ETests.java |
Added comprehensive end-to-end tests for dynamic PPAF enablement |
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/UserAgentContainer.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/UserAgentContainer.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RxDocumentClientImpl.java
Show resolved
Hide resolved
...cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/GlobalEndpointManager.java
Outdated
Show resolved
Hide resolved
...azure-cosmos-tests/src/test/java/com/azure/cosmos/PerPartitionAutomaticFailoverE2ETests.java
Show resolved
Hide resolved
…afDynamicEnablement
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
…afDynamicEnablement
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
...os-tests/src/test/java/com/azure/cosmos/FaultInjectionWithAvailabilityStrategyTestsBase.java
Show resolved
Hide resolved
/azp run java - cosmos - tests |
/azp run java - cosmos - spark |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - kafka |
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
logger.warn("Availability strategy for reads, queries, read all and read many" + | ||
" is enabled when PerPartitionAutomaticFailover is enabled."); | ||
logger.warn("As Per-Partition Automatic Failover (PPAF) is enabled a default End-to-End Operation Latency Policy will be applied for read, query, readAll and readyMany operation types."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this really have to be in warn-level (I know it was before - just asking)?
} | ||
if (this.globalPartitionEndpointManagerForPerPartitionAutomaticFailover.isPerPartitionAutomaticFailoverEnabled()) { | ||
// Override custom config to enabled if PPAF is enabled | ||
logger.warn("Per-Partition Circuit Breaker is enabled because Per-Partition Automatic Failover is enabled."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question - is warning reallly needed?
logger.warn("Per-Partition Circuit Breaker is enabled because Per-Partition Automatic Failover is enabled."); | ||
partitionLevelCircuitBreakerConfig = PartitionLevelCircuitBreakerConfig.fromExplicitArgs(Boolean.TRUE); | ||
} else { | ||
logger.warn("As Per-Partition Automatic Failover is disabled, Per-Partition Circuit Breaker will be enabled or disabled based on client configuration."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here?
); | ||
|
||
if (this.ppafEnforcedE2ELatencyPolicyConfigForReads != null) { | ||
logger.warn("Per-Partition Automatic Failover (PPAF) enforced E2E Latency Policy for reads is enabled."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents - we have the debated warn level log when rxDocCxlient is initialized already. If we have any other configs we want to log as warning for sure - let's include it in that one log line - that way we avoid the discussions where customers don't like the warning level log for things that are not concerning but just debuggability improvements in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through - I do want this to stand out since the client will start region fanout. I can combine it into a single log line yet keep it warn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM excdept for the log-level comments.
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
private List<CosmosOperationPolicy> operationPolicies; | ||
private final AtomicReference<CosmosAsyncClient> cachedCosmosAsyncClientSnapshot; | ||
private CosmosEndToEndOperationLatencyPolicyConfig ppafEnforcedE2ELatencyPolicyConfigForReads; | ||
private Function<DatabaseAccount, Void> perPartitionFailoverConfigModifier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumer should be enough then here?
} | ||
|
||
if (!Configs.isHttp2Enabled()) { | ||
userAgentFeatureFlags.remove(UserAgentFeatureFlags.Http2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we only remove the http2 flag if both conditions match:
- !Configs.ishttp2Enabled
- !Http2ConnectionConfig.isEnabled
final AtomicBoolean isQueryCancelledOnTimeout) { | ||
|
||
// reevaluate e2e policy config on cosmosQueryRequestOptions | ||
if (options != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change for queryPlan? because there is a e2e evaluation around line 1355
private volatile boolean isClosed; | ||
private volatile DatabaseAccount latestDatabaseAccount; | ||
private final AtomicBoolean hasThinClientReadLocations = new AtomicBoolean(false); | ||
private final AtomicBoolean lastRecordedPerPartitionAutomaticFailoverEnabledOnClient = new AtomicBoolean(Configs.isPerPartitionAutomaticFailoverEnabled().equalsIgnoreCase("true")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: if we already wire up with service for the ppaf flag, why do we still need to check the flag from Configs?
} | ||
|
||
private boolean hasPerPartitionAutomaticFailoverConfigChanged(DatabaseAccount databaseAccount) { | ||
Boolean currentPerPartitionAutomaticFailoverEnabledFromService = databaseAccount.isPerPartitionFailoverBehaviorEnabled(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method can be replaced with this.lastRecordedPerPartitionAutomaticFailoverEnabledOnClient.compareAndSet
?
public DiagnosticsClientConfig withIsPerPartitionAutomaticFailoverEnabled(boolean isPpafEnabled) { | ||
|
||
if (isPpafEnabled) { | ||
this.isPerPartitionAutomaticFailoverEnabledAsString = "true"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this.isPerPartitionAutomaticFailoverEnabledAsString = isPpafEnabled.toString()?
&& !sessionCapturingOverrideEnabled); | ||
this.sessionContainer.setDisableSessionCapturing(updatedDisableSessionCapturing); | ||
this.initializePerPartitionFailover(databaseAccountSnapshot); | ||
this.addUserAgentSuffix(this.userAgentContainer, EnumSet.allOf(UserAgentFeatureFlags.class)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ - for the first few requests - the ppaf feature flag here maybe not be accurate. But I guess it should be fine as it indicates this is from client initialization
public void init() { | ||
if (this.consecutiveExceptionBasedCircuitBreaker.isPartitionLevelCircuitBreakerEnabled()) { | ||
this.updateStaleLocationInfo().subscribeOn(this.partitionRecoveryScheduler).subscribe(); | ||
if (this.consecutiveExceptionBasedCircuitBreaker.isPartitionLevelCircuitBreakerEnabled() && !this.isPartitionRecoveryTaskRunning.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: do we want to stop the background task if ppcb disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Description
This pull request introduces a way to dynamically allow a
Cosmos(Async)Client
to be per-partition automatic failover capable when the service is per-partition automatic failover capable without the requirement of app-side client restarts.Approach taken
A callback is passed from
RxDocumentClientImpl
, which is responsible for:GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover
Invocation Context
This callback is invoked within
GlobalEndpointManager
, which:DatabaseAccount
payloadenablePerPartitionFailoverBehavior
boolean flagBehavior on Change
If a change in
enablePerPartitionFailoverBehavior
is detected:true
tofalse
, both PPAF and PPCB are disabled, and any cached failover information is cleared. Cross region availability strategy is disabled.Other changes
QueryPlan
calls.Testing done
Relevant Integration Tests
PerPartitionAutomaticFailoverE2ETests#testPpafWithWriteFailoverWithEligibleErrorStatusCodesWithPpafDynamicEnablement
PerPartitionAutomaticFailoverE2ETests#testFailoverBehaviorForNonWriteOperationsWithPpafDynamicEnablement
Summary
Both tests validate the following:
Initial State:
Failover does not occur when
enablePerPartitionFailoverBehavior
is reported asfalse
.Dynamic Enablement:
Without restarting or reinitializing the
CosmosAsyncClient
instance, aDatabaseAccount
refresh is triggered.Post-Refresh Behavior:
After the refresh,
enablePerPartitionFailoverBehavior
is reported astrue
.At this point:
Relevant end-to-end tests done
A workload (
Query
,Read
andCreate
) was run against a non-PPAF enabled account. At roughly 18:21 EST, a complete quorum loss was triggered inNorth Central US
. Post this, the account was enabled with PPAF (at roughly 18:28 was when theCosmosClient
instance got theDatabaseAccount
payload withenablePerPartitionFailoverBehavior
set totrue
.Recovery logs through PPAF and PPCB
Failover transition example
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines