Skip to content

Conversation

@valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Oct 21, 2025

Fixes issue where calendar events failed to update some jobs when associated with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.

Problem: UpdateJobProcessNotifier has a 1000-item queue and processes updates sequentially. It uses offer() on the queue, which silently drops updates when the queue is full.

However, calendar/filter updates don't need ordering guarantees. Hence, JobManager.submitJobEventUpdate() can bypass the queue and avoid the bottleneck of the queue size.

Another problem is the "fire-and-forget" pattern: submitJobEventUpdate() returns immediately without waiting for the update to complete. I introduce RefCountingListener to track the calendar updates. We start a background thread that updates the jobs and tracks succeeded, failed, and skipped jobs, while the request is returned immediately to prevent a timeout.

Finally, if the problem with failed job updates persists, I enhanced the logging throughout the system to create a trace for future diagnostics.

  • Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
  • Use RefCountingListener for parallel calendar/filter updates
  • Add comprehensive logging throughout the system
  • Create CalendarScalabilityIT integration tests
  • Add helper methods to base test class

Fixes #129777

- Refactor JobManager.submitJobEventUpdate() to bypass UpdateJobProcessNotifier queue
- Use RefCountingListener for parallel calendar/filter updates
- Add comprehensive logging throughout the system
- Create CalendarScalabilityIT integration tests
- Add helper methods to base test class

Fixes issue where calendar events failed to update some jobs when associated
with large numbers of jobs (>1000) due to queue capacity limits and sequential processing.
@valeriy42 valeriy42 added >bug v9.3.0 auto-backport Automatically create backport pull requests when merged v8.19.6 v9.1.6 v9.2.1 v8.19.7 v9.1.7 :ml Machine learning and removed v9.1.6 v8.19.6 labels Oct 21, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

valeriy42 and others added 10 commits October 21, 2025 17:17
…g to API calls and processing job updates asynchronously in the background.
…e handling in JobManager to include skipped updates. Update logging to reflect skipped updates during background calendar processing.
…hods and updating job creation visibility. Enhance ScheduledEventsIT to verify asynchronous calendar updates and add a plugin for tracking UpdateProcessAction calls.
…the updated logging package. This change improves consistency and aligns with recent codebase updates.
@valeriy42 valeriy42 marked this pull request as ready for review October 22, 2025 13:54
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 22, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle self-requested a review October 22, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml Machine learning Team:ML Meta label for the ML team v8.19.7 v9.1.7 v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Update calendar events for anomaly detection jobs fails sometimes

2 participants