-
Notifications
You must be signed in to change notification settings - Fork 993
Description
Describe the bug
CachedSupplier.maxStaleFailureJitter() computes exponential backoff using (1L << numFailures - 1) * 100, which overflows long when
numFailures reaches 58. The overflow produces a negative value, which bypasses the ComparableUtils.minimum(..., Duration.ofSeconds(10))
cap (since negative < 10s). This negative duration flows into jitterTime(), ultimately setting the cached value's stale time to
millions of years in the future. After this, the SDK never attempts to refresh credentials again for the lifetime of the process.
This creates a problem where after 58 failures, the credentials will never recover automatically necessitating a reboot of the system
Regression Issue
- Select this option if this issue appears to be a regression.
Expected Behavior
The 10-second jitter cap should always be respected regardless of the consecutive failure count. After a transient IMDS outage
resolves, credential refresh should resume normally.
Current Behavior
At consecutive failure 58, the stale time is set to a date millions of years in the future, permanently preventing credential refresh:
...
2026-03-18 04:45:47,085 [WARN] CachedSupplier: (InstanceProfileCredentialsProvider()) Cached value expiration has been extended to 2026-03-18T04:45:56.399919661Z because calling the downstream service failed (consecutive failures: 57).
2026-03-18 04:45:58,105 [WARN] CachedSupplier: (InstanceProfileCredentialsProvider()) Cached value expiration has been extended to +75191085-06-13T12:51:54.930203792Z because calling the downstream service failed (consecutive failures: 58).
Reproduction Steps
import software.amazon.awssdk.utils.cache.CachedSupplier;
import software.amazon.awssdk.utils.cache.RefreshResult;
import java.time.Instant;
public class CachedSupplierOverflowRepro {
public static void main(String[] args) {
// Supplier that always fails after initial seed value
java.util.concurrent.atomic.AtomicBoolean first = new java.util.concurrent.atomic.AtomicBoolean(true);
CachedSupplier<String> supplier = CachedSupplier.builder(
() -> {
if (first.compareAndSet(true, false)) {
// Return an already-stale value to trigger failure path
return RefreshResult.builder("initial")
.staleTime(Instant.now().minusSeconds(1))
.build();
}
throw new RuntimeException("Simulated IMDS failure");
})
.staleValueBehavior(CachedSupplier.StaleValueBehavior.ALLOW)
.cachedValueName("ReproTest")
.build();
// First call seeds the cache with a stale value
supplier.get();
// Subsequent calls trigger failures with incrementing counter
// After 58 iterations the overflow occurs
for (int i = 0; i < 60; i++) {
supplier.get();
// Observe log output — at failure 58 the expiration jumps to year 75M+
}
}
}
Possible Solution
This fixes maxStaleFailureJitter to only return positive duration since we'd never want to wait for negative duration
private Duration maxStaleFailureJitter(int numFailures) {
long exponentialBackoffMillis = (1L << (numFailures - 1)) * 100;
if (exponentialBackoffMillis <= 0) {
exponentialBackoffMillis = Long.MAX_VALUE;
}
return ComparableUtils.minimum(Duration.ofMillis(exponentialBackoffMillis), Duration.ofSeconds(10));
}
Additional Information/Context
No response
AWS Java SDK version used
All versions (bug exists since CachedSupplier was introduced; verified on current master)
JDK version used
openjdk version "11.0.15" 2022-04-19
Operating System and version
Operating System and version: Ubuntu 20.04.4 LTS (Focal Fossa)