Skip to content

chore(ci): Improve reliability of retries in TracingE2ET #2018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Aug 7, 2025

Conversation

phipag
Copy link
Contributor

@phipag phipag commented Aug 6, 2025

Summary

This PR addresses two reliability issues of the TracingE2E tests which were failing occasionally:

  1. The search horizon was too narrow. When the trace started at the end of minute x and some subsegments got populated in minute y, they were not found. I added a one minute padding to the search horizon.
  2. After finding trace ids for the search, we immediately attempted to fetch the (sub-)segments of that trace without accounting for that fact that they are populated async with a delay. I added retries for the second query for the sub-segments as well. See comment Maintenance: Fix TracingE2E test to avoid occasional timeouts #1846 (comment)

Also adds some more debug logs and more useful logging statements to make debugging easier in the future.

Changes

Issue number: #1846


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

@phipag phipag self-assigned this Aug 6, 2025
@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

I added another retry loop to a flaky MetricsE2ET which seems to pass reliably now. I will run some more E2E tests sequentially now.

The issue was that the metricsFetcher.fetchMetrics call completed as soon as the metric is found but it does not wait long enough for all datapoints to be populated in CloudWatch. In this case, we expect 2 datapoints and sometimes it already returned after finding the first datapoint.

        List<Double> orderMetrics = RetryUtils.withRetry(() -> {
            List<Double> metrics = metricsFetcher.fetchMetrics(invocationResult.getStart(), invocationResult.getEnd(),
                    60, NAMESPACE, "orders", Collections.singletonMap("Environment", "test"));
            if (metrics.get(0) != 2.0) {
                throw new DataNotReadyException("Expected 2.0 orders but got " + metrics.get(0));
            }
            return metrics;
        }, "orderMetricsRetry", DataNotReadyException.class).get();

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

The last 6 consecutive runs of E2E tests succeeded. It looks like we have resolved all flaky tests with appropriate retry logic for now.

@phipag phipag requested a review from dreamorosi August 7, 2025 15:30
Copy link
Contributor

@dreamorosi dreamorosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work with these tests!

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

Great work with these tests!

Thanks. This is the 3rd time now that I think they are fixed. Let's see if the tests prove me wrong in the next couple of weeks 😁

@phipag phipag merged commit 2b2e96f into main Aug 7, 2025
49 checks passed
@phipag phipag deleted the phipag/fix-tracing-e2e-timeouts branch August 7, 2025 16:14
@github-project-automation github-project-automation bot moved this from Pending review to Coming soon in Powertools for AWS Lambda (Java) Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Coming soon
Development

Successfully merging this pull request may close these issues.

Maintenance: Fix TracingE2E test to avoid occasional timeouts
2 participants