Skip to content

Conversation

@shijiesheng
Copy link
Member

@shijiesheng shijiesheng commented Nov 7, 2025

What changed?

Cross DC calls to frontend rely on DNSChooser (rather than the direct chooser for standard frontend calls). User reported staled peer list in frontend service after Disaster Recovery test in DNS and pointed the error was from domain replication processor, which uses the cross dc client.

{"level":"error","ts":"2025-10-27T15:17:00.285Z","msg":"Failed to get replication tasks","service":"cadence-worker","component":"replicator","component":"replication-task-processor","xdc-source-cluster":"primary","error":"code:unavailable message:\"round-robin\" peer list has 2 peers but none are responsive, timed out waiting for a connection to open (fail-fast is not enabled): context deadline exceeded","logging-call-at":"domain_replication_processor.go:171","stacktrace":"[github.com/uber/cadence/service/worker/replicator.(*domainReplicationProcessor).fetchDomainReplicationTasks](http://github.com/uber/cadence/service/worker/replicator.(*domainReplicationProcessor).fetchDomainReplicationTasks)\n\t/cadence/service/worker/replicator/domain_replication_processor.go:171\[ngithub.com/uber/cadence/service/worker/replicator.(*domainReplicationProcessor).processorLoop](http://ngithub.com/uber/cadence/service/worker/replicator.(*domainReplicationProcessor).processorLoop)\n\t/cadence/service/worker/replicator/domain_replication_processor.go:136
  • do not update dnsUpdater's current peer list if the update in transport failed.

Why?

If transport is fa

How did you test it?

Unit Test

Potential risks

Release notes

Documentation Changes

Signed-off-by: Shijie Sheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant