br/pkg/restore/snap_client: stabilize flaky TestConcurrency by flaky-claw · Pull Request #67887 · pingcap/tidb

flaky-claw · 2026-04-18T20:33:03Z

What problem does this PR solve?

Issue Number: close #66975

Problem Summary:
Flaky test TestConcurrency in br/pkg/restore/snap_client intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TEST_ISSUE: TestConcurrency depended on a failpoint-based barrier that is not guaranteed on the required test surface, so the concurrency fence was never engaged.

Fix

Replacing failpoint gating with a test-local copy-boundary interceptor keeps the same concurrency assertion while removing execution-surface dependence that caused deterministic/flaky failures.

Verification

Spec:

target: br/pkg/restore/snap_client :: TestConcurrency
strategy: tidb.go_flaky.default
plan mode: BASELINE_ONLY
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/restore/snap_client -run '^TestConcurrency$' -count=1
go test -json ./br/pkg/restore/snap_client -count=1
make build

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #66975

Summary by CodeRabbit

Tests
- Improved internal test infrastructure and reliability for snapshot operations.

pantheon-ai · 2026-04-18T20:33:09Z

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

tiprow · 2026-04-18T20:33:23Z

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-18T20:33:27Z

📝 Walkthrough

Walkthrough

The PR stabilizes a flaky test by replacing failpoint-based synchronization with a storage-layer interceptor wrapper that captures CopyFrom calls, shifting the synchronization point from the snap-client put-sst operation to the underlying storage copy operation.

Changes

Cohort / File(s)	Summary
Test Synchronization Refactoring `br/pkg/restore/snap_client/pitr_collector_test.go`	Removed failpoint synchronization around snap-client operations and introduced `copyInterceptorStorage` test wrapper that intercepts `CopyFrom` calls to provide more reliable test synchronization, reducing flakiness in `TestConcurrency`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

br: stabilize TestConcurrency in pitr_collector_test #66962 — Also modifies br/pkg/restore/snap_client/pitr_collector_test.go to stabilize TestConcurrency, addressing the same flaky test through different synchronization mechanisms.

Suggested labels

component/br, component/test, size/S, ok-to-test, approved, lgtm

Suggested reviewers

YuJuncen
D3Hunter
Leavrth

Poem

🐰 Beneath the storage layer deep,
A copier makes promises to keep,
No failpoints needed, just a spy—
Now tests won't flake before they fly! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: stabilizing a flaky test in br/pkg/restore/snap_client by fixing TestConcurrency.
Description check	✅ Passed	The description includes issue number (close `#66975`), problem summary, detailed explanation of root cause and fix, verification steps with test commands and gate results, and completed checklist with unit test marked.
Linked Issues check	✅ Passed	The PR directly addresses issue `#66975` by identifying and fixing the root cause of the flaky TestConcurrency test through replacing failpoint-based barriers with a test-local copy-boundary interceptor.
Out of Scope Changes check	✅ Passed	All changes are scoped to stabilizing TestConcurrency: removing failpoint usage, adding a test wrapper interceptor, and updating task storage configuration for the test.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

br/pkg/restore/snap_client/pitr_collector_test.go (1)
311-367: Optional: tighten the cleanup ordering.

t.Cleanup at Line 334 already calls closeFence() + wg.Wait(), so the explicit closeFence(); wg.Wait() at Lines 354-355 is redundant (safe due to sync.Once, just noise). If you want to keep the fence release explicit for readability, consider dropping the duplicate wg.Wait() or removing the cleanup's wg.Wait() to avoid doing it twice. Non-blocking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/restore/snap_client/pitr_collector_test.go` around lines 311 - 367, In
TestConcurrency remove the duplicated wait by either (A) keeping the explicit
closeFence(); wg.Wait() before collecting results and remove wg.Wait() from the
t.Cleanup closure, or (B) keep wg.Wait() in t.Cleanup and simply call
closeFence() (no wg.Wait()) in the main test body; update the t.Cleanup closure
or the main teardown accordingly so wg.Wait() is invoked exactly once
(references: TestConcurrency, closeFence, wg, t.Cleanup).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/restore/snap_client/pitr_collector_test.go`:
- Around line 311-367: In TestConcurrency remove the duplicated wait by either
(A) keeping the explicit closeFence(); wg.Wait() before collecting results and
remove wg.Wait() from the t.Cleanup closure, or (B) keep wg.Wait() in t.Cleanup
and simply call closeFence() (no wg.Wait()) in the main test body; update the
t.Cleanup closure or the main teardown accordingly so wg.Wait() is invoked
exactly once (references: TestConcurrency, closeFence, wg, t.Cleanup).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a1aa3e2c-a487-47c7-99cc-ae212027d7c9

📥 Commits

Reviewing files that changed from the base of the PR and between ce92298 and a706449.

📒 Files selected for processing (1)

br/pkg/restore/snap_client/pitr_collector_test.go

codecov · 2026-04-18T20:51:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.6380%. Comparing base (e3f45e4) to head (a706449).
⚠️ Report is 39 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67887        +/-   ##
================================================
+ Coverage   77.7969%   78.6380%   +0.8410%     
================================================
  Files          1983       1984         +1     
  Lines        548948     549125       +177     
================================================
+ Hits         427065     431821      +4756     
+ Misses       120962     116285      -4677     
- Partials        921       1019        +98

Flag	Coverage Δ
integration	`44.2880% <ø> (+4.4908%)`	⬆️
unit	`76.6562% <ø> (+0.3066%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`61.5065% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`65.9884% <ø> (+2.8768%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yinsustart · 2026-04-21T05:28:58Z

/retest

tiprow · 2026-04-21T05:29:21Z

@yinsustart: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

YuJuncen

w/o t.Parallel() I'm not sure whehter there is acually a race on failpoints... Anyway this isn't harmful.

YuJuncen · 2026-04-23T04:14:40Z

+}
+
+func (s *copyInterceptorStorage) CopyFrom(ctx context.Context, from storeapi.Storage, spec storeapi.CopySpec) error {
+	if s.onCopy != nil {


Remove this.

ti-chi-bot · 2026-04-24T02:01:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Leavrth, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~br/OWNERS~~ [Leavrth,YuJuncen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-24T02:01:11Z

[LGTM Timeline notifier]

Timeline:

2026-04-23 04:14:56.487710307 +0000 UTC m=+2225701.693070364: ☑️ agreed by YuJuncen.
2026-04-24 02:01:10.515955017 +0000 UTC m=+2304075.721315073: ☑️ agreed by Leavrth.

fix: stabilize flaky issue pingcap#66975

a706449

ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 18, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 18, 2026

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

yinsustart requested review from GMHDBJD and YuJuncen April 22, 2026 12:07

YuJuncen approved these changes Apr 23, 2026

View reviewed changes

ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 23, 2026

Leavrth approved these changes Apr 24, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 24, 2026

Conversation

flaky-claw commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Root Cause

Fix

Verification

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiprow Bot commented Apr 18, 2026

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yinsustart commented Apr 21, 2026

Uh oh!

tiprow Bot commented Apr 21, 2026

Uh oh!

YuJuncen left a comment

Choose a reason for hiding this comment

Uh oh!

YuJuncen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

Uh oh!

ti-chi-bot Bot commented Apr 24, 2026

[LGTM Timeline notifier]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

flaky-claw commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Apr 18, 2026 •

edited

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

codecov Bot commented Apr 18, 2026 •

edited

Loading