chore(ci): Improve reliability of retries in TracingE2ET #2018

phipag · 2025-08-06T15:01:39Z

Summary

This PR addresses two reliability issues of the TracingE2E tests which were failing occasionally:

The search horizon was too narrow. When the trace started at the end of minute x and some subsegments got populated in minute y, they were not found. I added a one minute padding to the search horizon.
After finding trace ids for the search, we immediately attempted to fetch the (sub-)segments of that trace without accounting for that fact that they are populated async with a delay. I added retries for the second query for the sub-segments as well. See comment Maintenance: Fix TracingE2E test to avoid occasional timeouts #1846 (comment)

Also adds some more debug logs and more useful logging statements to make debugging easier in the future.

Changes

Issue number: #1846

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

phipag · 2025-08-07T13:29:47Z

I added another retry loop to a flaky MetricsE2ET which seems to pass reliably now. I will run some more E2E tests sequentially now.

The issue was that the metricsFetcher.fetchMetrics call completed as soon as the metric is found but it does not wait long enough for all datapoints to be populated in CloudWatch. In this case, we expect 2 datapoints and sometimes it already returned after finding the first datapoint.

        List<Double> orderMetrics = RetryUtils.withRetry(() -> {
            List<Double> metrics = metricsFetcher.fetchMetrics(invocationResult.getStart(), invocationResult.getEnd(),
                    60, NAMESPACE, "orders", Collections.singletonMap("Environment", "test"));
            if (metrics.get(0) != 2.0) {
                throw new DataNotReadyException("Expected 2.0 orders but got " + metrics.get(0));
            }
            return metrics;
        }, "orderMetricsRetry", DataNotReadyException.class).get();

phipag · 2025-08-07T15:09:05Z

The last 6 consecutive runs of E2E tests succeeded. It looks like we have resolved all flaky tests with appropriate retry logic for now.

dreamorosi

Great work with these tests!

phipag · 2025-08-07T16:13:58Z

Great work with these tests!

Thanks. This is the 3rd time now that I think they are fixed. Let's see if the tests prove me wrong in the next couple of weeks 😁

phipag added 2 commits August 6, 2025 12:21

Expand timewindow for X-RAY trace summary search during E2E tests.

71fd5f0

Add retries for trace subsegments as well.

0d52313

phipag self-assigned this Aug 6, 2025

phipag added maintenance governance labels Aug 6, 2025

phipag added this to Powertools for AWS Lambda (Java) Aug 6, 2025

pull-request-size bot added the size/M label Aug 6, 2025

phipag temporarily deployed to E2E August 6, 2025 15:02 — with GitHub Actions Inactive

phipag had a problem deploying to E2E August 6, 2025 15:02 — with GitHub Actions Failure

phipag temporarily deployed to E2E August 6, 2025 15:02 — with GitHub Actions Inactive

phipag temporarily deployed to E2E August 7, 2025 13:27 — with GitHub Actions Inactive

phipag temporarily deployed to E2E August 7, 2025 14:03 — with GitHub Actions Inactive

phipag temporarily deployed to E2E August 7, 2025 14:40 — with GitHub Actions Inactive

phipag temporarily deployed to E2E August 7, 2025 15:02 — with GitHub Actions Inactive

phipag requested a review from dreamorosi August 7, 2025 15:30

dreamorosi approved these changes Aug 7, 2025

View reviewed changes

phipag merged commit 2b2e96f into main Aug 7, 2025
49 checks passed

phipag deleted the phipag/fix-tracing-e2e-timeouts branch August 7, 2025 16:14

github-project-automation bot moved this from Pending review to Coming soon in Powertools for AWS Lambda (Java) Aug 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(ci): Improve reliability of retries in TracingE2ET #2018

chore(ci): Improve reliability of retries in TracingE2ET #2018

Uh oh!

phipag commented Aug 6, 2025 •

edited

Loading

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

dreamorosi left a comment

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

chore(ci): Improve reliability of retries in TracingE2ET #2018

chore(ci): Improve reliability of retries in TracingE2ET #2018

Uh oh!

Conversation

phipag commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

dreamorosi left a comment

Choose a reason for hiding this comment

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

phipag commented Aug 6, 2025 •

edited

Loading