Skip to content

Conversation

jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Aug 21, 2025

Description

This pull request introduces a way to dynamically allow a Cosmos(Async)Client to be per-partition automatic failover capable when the service is per-partition automatic failover capable without the requirement of app-side client restarts.

Approach taken

A callback is passed from RxDocumentClientImpl, which is responsible for:

  • Setting the opt-in flags on:
    • GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker
    • GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover
  • Adding the user agent suffix
  • Determining whether the cross-region availability strategy should be enabled for non-write operations

Invocation Context

This callback is invoked within GlobalEndpointManager, which:

  • Makes calls to the Gateway to retrieve the DatabaseAccount payload
  • Monitors for changes in the enablePerPartitionFailoverBehavior boolean flag

Behavior on Change

If a change in enablePerPartitionFailoverBehavior is detected:

  • The client-side opt-in for failover strategies is updated accordingly. If the transition is from true to false, both PPAF and PPCB are disabled, and any cached failover information is cleared. Cross region availability strategy is disabled.

Other changes

  • PPAF-enforced hedging is wired for the QueryPlan calls.

Testing done

Relevant Integration Tests

  • PerPartitionAutomaticFailoverE2ETests#testPpafWithWriteFailoverWithEligibleErrorStatusCodesWithPpafDynamicEnablement
  • PerPartitionAutomaticFailoverE2ETests#testFailoverBehaviorForNonWriteOperationsWithPpafDynamicEnablement

Summary

Both tests validate the following:

  • Initial State:
    Failover does not occur when enablePerPartitionFailoverBehavior is reported as false.

  • Dynamic Enablement:
    Without restarting or reinitializing the CosmosAsyncClient instance, a DatabaseAccount refresh is triggered.

  • Post-Refresh Behavior:
    After the refresh, enablePerPartitionFailoverBehavior is reported as true.
    At this point:

    • Failover behavior for writes and reads is validated.
    • Hedging behavior for reads is also validated.

Relevant end-to-end tests done

A workload (Query, Read and Create) was run against a non-PPAF enabled account. At roughly 18:21 EST, a complete quorum loss was triggered in North Central US. Post this, the account was enabled with PPAF (at roughly 18:28 was when the CosmosClient instance got the DatabaseAccount payload with enablePerPartitionFailoverBehavior set to true.

2025-09-21 18:28:16 INFO  ClientRetryPolicy:423 - shouldRetryOnServiceUnavailable() Retrying. Received on endpoint https://northcentralus.documents-test.windows-int.net:443/, IsReadRequest = true
2025-09-21 18:28:16 INFO  WorkloadUtils:640 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:16.368229700Z, operationType=query, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1082, failureCountUntilNow=106, threadId=2, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=north central us,west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H50M0.6616921S, latestRecordedSessionToken=null]
2025-09-21 18:28:16 WARN  GlobalEndpointManager:377 - Per partition automatic failover enabled: true, applying modifier
2025-09-21 18:28:16 WARN  RxDocumentClientImpl:7902 - Per-Partition Circuit Breaker is enabled because Per-Partition Automatic Failover is enabled.
2025-09-21 18:28:16 WARN  RxDocumentClientImpl:7439 - As Per-Partition Automatic Failover (PPAF) is enabled a default End-to-End Operation Latency Policy will be applied for read, query, readAll and readyMany operation types.
2025-09-21 18:28:16 WARN  RxDocumentClientImpl:7920 - Per-Partition Automatic Failover (PPAF) enforced E2E Latency Policy for reads is enabled.
2025-09-21 18:28:16 INFO  GlobalEndpointManager:342 - db account retrieved {"_self":"","id":"abhm-test14-strong-3r-01","_rid":"abhm-test14-strong-3r-01.documents-test.windows-int.net","media":"//media/","addresses":"//addresses/","_dbs":"//dbs/","writableLocations":[{"name":"North Central US","databaseAccountEndpoint":"https://.documents-test.windows-int.net:443/"}],"readableLocations":[{"name":"North Central US","databaseAccountEndpoint":"https://abhm-test14-strong-3r-01-northcentralus.documents-test.windows-int.net:443/"},{"name":"West US","databaseAccountEndpoint":"https://.documents-test.windows-int.net:443/"},{"name":"East Asia","databaseAccountEndpoint":"https://abhm-test14-strong-3r-01-eastasia.documents-test.windows-int.net:443/"}],"enableMultipleWriteLocations":false,"continuousBackupEnabled":true,"enableNRegionSynchronousCommit":false,"enablePerPartitionFailoverBehavior":true,"userReplicationPolicy":{"asyncReplication":false,"minReplicaSetSize":2,"maxReplicasetSize":4},"userConsistencyPolicy":{"defaultConsistencyLevel":"Strong"},"systemReplicationPolicy":{"minReplicaSetSize":3,"maxReplicasetSize":4},"readPolicy":{"primaryReadCoefficient":1,"secondaryReadCoefficient":1},"queryEngineConfiguration":"{\"allowNewKeywords\":true,\"maxJoinsPerSqlQuery\":10,\"maxQueryRequestTimeoutFraction\":0.9,\"maxSqlQueryInputLength\":524288,\"maxUdfRefPerSqlQuery\":10,\"queryMaxInMemorySortDocumentCount\":-1000,\"spatialMaxGeometryPointCount\":256,\"sqlAllowNonFiniteNumbers\":false,\"sqlDisableOptimizationFlags\":0,\"sqlQueryILDisableOptimizationFlags\":0,\"clientDisableOptimisticDirectExecution\":false,\"queryEnableFullText\":true,\"queryEnableFullTextPreviewFeatures\":false,\"queryMaxFullTextScoreSearchTerms\":5,\"queryMaxRrfArgumentCount\":100,\"enableSpatialIndexing\":true,\"maxInExpressionItemsCount\":2147483647,\"maxLogicalAndPerSqlQuery\":2147483647,\"maxLogicalOrPerSqlQuery\":2147483647,\"maxSpatialQueryCells\":2147483647,\"sqlAllowAggregateFunctions\":true,\"sqlAllowGroupByClause\":true,\"sqlAllowLike\":true,\"sqlAllowSubQuery\":true,\"sqlAllowScalarSubQuery\":true,\"sqlAllowTop\":true}"}

Recovery logs through PPAF and PPCB

2025-09-21 18:28:19 INFO  WorkloadUtils:391 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.646311500Z, operationType=read, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1282, failureCountUntilNow=99, threadId=1, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://a.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.3836103S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:146 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.723726700Z, operationType=create, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=361, failureCountUntilNow=1093, threadId=0, statusCode=201, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.3061951S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:146 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.723726700Z, operationType=create, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=359, failureCountUntilNow=1093, threadId=6, statusCode=201, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.3061951S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:391 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.723726700Z, operationType=read, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1283, failureCountUntilNow=99, threadId=4, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.3061951S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:146 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.723726700Z, operationType=create, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=360, failureCountUntilNow=1093, threadId=3, statusCode=201, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.3061951S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:640 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.793743600Z, operationType=query, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1088, failureCountUntilNow=106, threadId=5, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=north central us,west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://abhm-test14-strong-3r-01.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.2361782S, latestRecordedSessionToken=null]
2025-09-21 18:28:19 INFO  WorkloadUtils:640 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:19.954611700Z, operationType=query, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1089, failureCountUntilNow=106, threadId=2, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=east asia,north central us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://abhm-test14-strong-3r-01.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M57.0753101S, latestRecordedSessionToken=null]
2025-09-21 18:28:20 INFO  WorkloadUtils:640 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:20.107600900Z, operationType=query, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1090, failureCountUntilNow=106, threadId=2, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=north central us,west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M56.9223209S, latestRecordedSessionToken=null]
2025-09-21 18:28:20 INFO  WorkloadUtils:391 - RequestResponseInfo [timeOfResponse=2025-09-21T22:28:20.435655100Z, operationType=read, drillId=ppaf-dynamic-enablement-test, successCountUntilNow=1284, failureCountUntilNow=99, threadId=7, statusCode=200, subStatusCode=0, commaSeparatedContactedRegions=west us, cosmosDiagnosticsContext=, errorMessage=, connectionModeAsStr=DIRECT, containerName=TestContainer, accountName=https://.documents-test.windows-int.net:443/, possiblyColdStartClient=false, databaseName=TestDatabase, runTimeRemaining=PT1H49M56.5942667S, latestRecordedSessionToken=null]

Failover transition example

image

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@jeet1995
Copy link
Member Author

jeet1995 commented Sep 8, 2025

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995 jeet1995 marked this pull request as ready for review September 8, 2025 17:10
@Copilot Copilot AI review requested due to automatic review settings September 8, 2025 17:10
@jeet1995 jeet1995 requested review from kirankumarkolli and a team as code owners September 8, 2025 17:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements dynamic enablement of Per-Partition Automatic Failover (PPAF) in the Azure Cosmos DB Java SDK. The key enhancement allows PPAF to be enabled/disabled at runtime based on service-side configuration, moving away from static client-side configuration.

Key changes include:

  • Enhanced GlobalEndpointManager to monitor and respond to database account configuration changes
  • Added dynamic PPAF configuration capability through a callback mechanism
  • Extended user agent feature flags to include ThinClient and Http2 support
  • Improved thread safety in UserAgentContainer with read-write locks

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
GlobalPartitionEndpointManagerForPerPartitionCircuitBreaker.java Added clear() method and synchronized resetCircuitBreakerConfig
GlobalPartitionEndpointManagerForPerPartitionAutomaticFailover.java Added synchronized resetPerPartitionAutomaticFailoverEnabled with state clearing
UserAgentFeatureFlags.java Added ThinClient and Http2 feature flags
UserAgentContainer.java Enhanced thread safety with read-write locks and improved flag handling
RxGatewayStoreModel.java Added gateway response recording for cancelled requests
RxDocumentClientImpl.java Refactored PPAF initialization to support dynamic enablement
GlobalEndpointManager.java Added dynamic PPAF monitoring and callback mechanism
DiagnosticsClientContext.java Updated diagnostics to track dynamic PPAF state
ReflectionUtils.java Added reflection utilities for testing GlobalEndpointManager owner
PerPartitionAutomaticFailoverE2ETests.java Added comprehensive end-to-end tests for dynamic PPAF enablement

@jeet1995 jeet1995 changed the title [No Review]: Dynamic enablement of Per-Partition Automatic Failover Dynamic enablement of Per-Partition Automatic Failover Sep 8, 2025
@jeet1995
Copy link
Member Author

jeet1995 commented Sep 9, 2025

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@jeet1995
Copy link
Member Author

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - kafka

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


logger.warn("Availability strategy for reads, queries, read all and read many" +
" is enabled when PerPartitionAutomaticFailover is enabled.");
logger.warn("As Per-Partition Automatic Failover (PPAF) is enabled a default End-to-End Operation Latency Policy will be applied for read, query, readAll and readyMany operation types.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really have to be in warn-level (I know it was before - just asking)?

}
if (this.globalPartitionEndpointManagerForPerPartitionAutomaticFailover.isPerPartitionAutomaticFailoverEnabled()) {
// Override custom config to enabled if PPAF is enabled
logger.warn("Per-Partition Circuit Breaker is enabled because Per-Partition Automatic Failover is enabled.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question - is warning reallly needed?

logger.warn("Per-Partition Circuit Breaker is enabled because Per-Partition Automatic Failover is enabled.");
partitionLevelCircuitBreakerConfig = PartitionLevelCircuitBreakerConfig.fromExplicitArgs(Boolean.TRUE);
} else {
logger.warn("As Per-Partition Automatic Failover is disabled, Per-Partition Circuit Breaker will be enabled or disabled based on client configuration.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here?

);

if (this.ppafEnforcedE2ELatencyPolicyConfigForReads != null) {
logger.warn("Per-Partition Automatic Failover (PPAF) enforced E2E Latency Policy for reads is enabled.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents - we have the debated warn level log when rxDocCxlient is initialized already. If we have any other configs we want to log as warning for sure - let's include it in that one log line - that way we avoid the discussions where customers don't like the warning level log for things that are not concerning but just debuggability improvements in one place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through - I do want this to stand out since the client will start region fanout. I can combine it into a single log line yet keep it warn.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM excdept for the log-level comments.

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

private List<CosmosOperationPolicy> operationPolicies;
private final AtomicReference<CosmosAsyncClient> cachedCosmosAsyncClientSnapshot;
private CosmosEndToEndOperationLatencyPolicyConfig ppafEnforcedE2ELatencyPolicyConfigForReads;
private Function<DatabaseAccount, Void> perPartitionFailoverConfigModifier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consumer should be enough then here?

}

if (!Configs.isHttp2Enabled()) {
userAgentFeatureFlags.remove(UserAgentFeatureFlags.Http2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only remove the http2 flag if both conditions match:

  • !Configs.ishttp2Enabled
  • !Http2ConnectionConfig.isEnabled

final AtomicBoolean isQueryCancelledOnTimeout) {

// reevaluate e2e policy config on cosmosQueryRequestOptions
if (options != null) {
Copy link
Member

@xinlian12 xinlian12 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change for queryPlan? because there is a e2e evaluation around line 1355

private volatile boolean isClosed;
private volatile DatabaseAccount latestDatabaseAccount;
private final AtomicBoolean hasThinClientReadLocations = new AtomicBoolean(false);
private final AtomicBoolean lastRecordedPerPartitionAutomaticFailoverEnabledOnClient = new AtomicBoolean(Configs.isPerPartitionAutomaticFailoverEnabled().equalsIgnoreCase("true"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: if we already wire up with service for the ppaf flag, why do we still need to check the flag from Configs?

}

private boolean hasPerPartitionAutomaticFailoverConfigChanged(DatabaseAccount databaseAccount) {
Boolean currentPerPartitionAutomaticFailoverEnabledFromService = databaseAccount.isPerPartitionFailoverBehaviorEnabled();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be replaced with this.lastRecordedPerPartitionAutomaticFailoverEnabledOnClient.compareAndSet?

public DiagnosticsClientConfig withIsPerPartitionAutomaticFailoverEnabled(boolean isPpafEnabled) {

if (isPpafEnabled) {
this.isPerPartitionAutomaticFailoverEnabledAsString = "true";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this.isPerPartitionAutomaticFailoverEnabledAsString = isPpafEnabled.toString()?

&& !sessionCapturingOverrideEnabled);
this.sessionContainer.setDisableSessionCapturing(updatedDisableSessionCapturing);
this.initializePerPartitionFailover(databaseAccountSnapshot);
this.addUserAgentSuffix(this.userAgentContainer, EnumSet.allOf(UserAgentFeatureFlags.class));
Copy link
Member

@xinlian12 xinlian12 Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ - for the first few requests - the ppaf feature flag here maybe not be accurate. But I guess it should be fine as it indicates this is from client initialization

public void init() {
if (this.consecutiveExceptionBasedCircuitBreaker.isPartitionLevelCircuitBreakerEnabled()) {
this.updateStaleLocationInfo().subscribeOn(this.partitionRecoveryScheduler).subscribe();
if (this.consecutiveExceptionBasedCircuitBreaker.isPartitionLevelCircuitBreakerEnabled() && !this.isPartitionRecoveryTaskRunning.get()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: do we want to stop the background task if ppcb disabled?

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants