-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Hybrid Search: Adding fixes for schedulingStopWatch start/stop for a race condition #46485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hybrid Search: Adding fixes for schedulingStopWatch start/stop for a race condition #46485
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition in the SchedulingStopwatch class that was causing intermittent IllegalStateException errors during hybrid search operations. The issue occurred when multiple parallel component queries tried to start or stop the same stopwatch simultaneously.
- Made SchedulingStopwatch start() and stop() methods atomic and idempotent using synchronized blocks
- Added early return checks to prevent redundant operations on already started/stopped stopwatches
- Enhanced helper methods startStopWatch() and stopStopWatch() to be idempotent
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
SchedulingStopwatch.java | Added synchronization and idempotent checks to start/stop methods to prevent race conditions |
CHANGELOG.md | Added entry documenting the race condition fix for hybrid search queries |
...-cosmos/src/main/java/com/azure/cosmos/implementation/query/metrics/SchedulingStopwatch.java
Show resolved
Hide resolved
/azp run java - cosmos |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
...-cosmos/src/main/java/com/azure/cosmos/implementation/query/metrics/SchedulingStopwatch.java
Show resolved
Hide resolved
/check-enforcer override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Intermittent IllegalStateException: “Stopwatch already started / not running” caused by race conditions in SchedulingStopwatch start()/stop() when reactive retries or overlapping emissions occur.
Changes made:
SchedulingStopwatch:
Reason for the Stop Watch Change
Hybrid search breaks one user query into several internal “component” queries (and one extra global statistics query first). Those component queries run in parallel. Each one triggers its own fetch cycle, and retries can also overlap. The stopwatch code assumed a simple pattern: start → (work) → stop, with no overlap. In hybrid search, two overlapping fetch attempts could both try to start (or stop) the same runtime stopwatch at almost the same moment. Because the old code checked isStarted() outside the synchronized block, both threads could slip through and the second thread would hit “Stopwatch already started” (or “not running” on stop). Other query types (non‑hybrid) usually drive a single sequential fetch path per partition, so the timing windows almost never lined up to trigger the race.
Makes SchedulingStopwatch start/stop idempotent and atomic. This is a safe improvement for every query type: it doesn’t change timings or metrics values, it just prevents exceptions if future parallelism increases. Non-hybrid queries continue to behave the same; they simply gain protection against a race they almost never hit.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines