Skip to content

Conversation

aayush3011
Copy link
Member

Description

Intermittent IllegalStateException: “Stopwatch already started / not running” caused by race conditions in SchedulingStopwatch start()/stop() when reactive retries or overlapping emissions occur.

Changes made:

SchedulingStopwatch:

  • start() and stop() made atomic & idempotent (synchronized on runTimeStopwatch; early return if already in desired state).
  • Helper start/stop methods made idempotent to prevent IllegalStateException from concurrent or redundant calls.

Reason for the Stop Watch Change

Hybrid search breaks one user query into several internal “component” queries (and one extra global statistics query first). Those component queries run in parallel. Each one triggers its own fetch cycle, and retries can also overlap. The stopwatch code assumed a simple pattern: start → (work) → stop, with no overlap. In hybrid search, two overlapping fetch attempts could both try to start (or stop) the same runtime stopwatch at almost the same moment. Because the old code checked isStarted() outside the synchronized block, both threads could slip through and the second thread would hit “Stopwatch already started” (or “not running” on stop). Other query types (non‑hybrid) usually drive a single sequential fetch path per partition, so the timing windows almost never lined up to trigger the race.

Makes SchedulingStopwatch start/stop idempotent and atomic. This is a safe improvement for every query type: it doesn’t change timings or metrics values, it just prevents exceptions if future parallelism increases. Non-hybrid queries continue to behave the same; they simply gain protection against a race they almost never hit.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@Copilot Copilot AI review requested due to automatic review settings August 22, 2025 18:31
@aayush3011 aayush3011 requested review from kirankumarkolli and a team as code owners August 22, 2025 18:31
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in the SchedulingStopwatch class that was causing intermittent IllegalStateException errors during hybrid search operations. The issue occurred when multiple parallel component queries tried to start or stop the same stopwatch simultaneously.

  • Made SchedulingStopwatch start() and stop() methods atomic and idempotent using synchronized blocks
  • Added early return checks to prevent redundant operations on already started/stopped stopwatches
  • Enhanced helper methods startStopWatch() and stopStopWatch() to be idempotent

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
SchedulingStopwatch.java Added synchronization and idempotent checks to start/stop methods to prevent race conditions
CHANGELOG.md Added entry documenting the race condition fix for hybrid search queries

@aayush3011
Copy link
Member Author

/azp run java - cosmos

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aayush3011
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aayush3011
Copy link
Member Author

/check-enforcer override

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeet1995 jeet1995 merged commit be82227 into Azure:main Sep 4, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants