Skip to content

fix(es): add delay in TestPasswordFromFile to prevent flakiness#8150

Open
Br1an67 wants to merge 1 commit intojaegertracing:mainfrom
Br1an67:fix/issue-4743-close-es-client
Open

fix(es): add delay in TestPasswordFromFile to prevent flakiness#8150
Br1an67 wants to merge 1 commit intojaegertracing:mainfrom
Br1an67:fix/issue-4743-close-es-client

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 8, 2026

Fixes #4743

Summary

The TestPasswordFromFile test was flaky because the Elasticsearch client has background health check goroutines that may continue running briefly after Close() is called. When the password changes and a new client is created, the old client is closed but its health check goroutine might still make requests with the old password for a brief moment after the client swap.

Changes

  • Added a 100ms delay after the client changes in the test to allow the old client's background goroutines to fully shut down before proceeding with the new password verification

Testing

  • Ran the test 5 times in a row without any failures:
    go test -v -run TestPasswordFromFile ./internal/storage/v1/elasticsearch/... -count=5
    
  • All 5 runs passed successfully

Root Cause

The olivere/elastic client library's Stop() method signals background goroutines to stop but does not wait for them to complete. This can cause a race condition where:

  1. Password file changes
  2. New client is created with new password
  3. Old client is swapped out and Close() is called
  4. Test writes span2 and verifies new password
  5. Old client's health check goroutine (still running) makes a request with old password

The delay gives the old client's background operations time to fully terminate before the test proceeds.

The ES client has background health check goroutines that may continue
running briefly after Close() is called. When the password changes and
a new client is created, the old client is closed but its health check
may still make requests with the old password.

This adds a small delay after the client changes to allow the old
client's background goroutines to fully shut down before the test
proceeds to verify the new password.

Fixes jaegertracing#4743

Signed-off-by: root <root@C20251020184286.local>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses flakiness in TestPasswordFromFile caused by background Elasticsearch client goroutines continuing briefly after Close() during password rotation, leading to occasional requests with the old credentials.

Changes:

  • Adds a short delay after detecting the client swap in runPasswordFromFileTest to reduce race-related flakiness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +431 to +436
// Give the old client time to fully shut down its background goroutines
// (e.g., health checks) after being closed in onClientPasswordChange.
// This prevents flakiness where the old client might still make requests
// with the old password after the client has been swapped.
time.Sleep(100 * time.Millisecond)

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a fixed time.Sleep here makes the test slower and can still be flaky on slow/loaded CI (the healthcheck goroutine may take longer than 100ms to stop). Since this test only validates auth headers on write requests, consider disabling background health checks in the test configuration (e.g., cfg.DisableHealthCheck = true) or replacing the sleep with a deterministic wait condition tied to observed requests (e.g., wait until no further requests with the old password are seen for a short window).

Suggested change
// Give the old client time to fully shut down its background goroutines
// (e.g., health checks) after being closed in onClientPasswordChange.
// This prevents flakiness where the old client might still make requests
// with the old password after the client has been swapped.
time.Sleep(100 * time.Millisecond)

Copilot uses AI. Check for mistakes.
// (e.g., health checks) after being closed in onClientPasswordChange.
// This prevents flakiness where the old client might still make requests
// with the old password after the client has been swapped.
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeouts do not improve flakiness, the solution needs to be deterministic.

@Br1an67 Br1an67 force-pushed the fix/issue-4743-close-es-client branch from b4cae47 to 5393205 Compare March 18, 2026 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[test] Flaky test TestPasswordFromFile in plugin/storage/es/factory_test.go

3 participants