Skip to content

localGroupConcurrency excess restore corrupts retry_count#759

Open
bcomnes wants to merge 3 commits intotimgit:masterfrom
bcomnes:restore-jobs-started-on
Open

localGroupConcurrency excess restore corrupts retry_count#759
bcomnes wants to merge 3 commits intotimgit:masterfrom
bcomnes:restore-jobs-started-on

Conversation

@bcomnes
Copy link
Copy Markdown
Contributor

@bcomnes bcomnes commented Apr 9, 2026

I spotted another bug when working on #756

When localConcurrency exceeds localGroupConcurrency (the typical setup), multiple workers fetch jobs concurrently. Because the fetch query sets all fetched rows to active in the database before the application layer checks local concurrency limits, some workers will end up with jobs they cannot process. These excess jobs are restored to created state via restoreJobs. The problem is that restoreJobs only resets state and leaves started_on set from the initial activation.

The fetch UPDATE in fetchNextJob determines whether to increment retry_count based on the pre-update value of started_on:

retry_count = CASE WHEN started_on IS NOT NULL THEN retry_count + 1 ELSE retry_count END

The intent is to track genuine retries: if started_on is already set when a job is being activated, it must have been run before. But after an excess restore, started_on is still set even though the handler never ran. The next time the job is fetched, started_on IS NOT NULL triggers the increment and the job silently burns a retry credit it never used.

How it manifests

This is difficult to notice because there is no error, no log line and nothing on the error event. Jobs simply move to failed state sooner than their retryLimit suggests they should. A job configured with retryLimit: 2 could exhaust its entire retry budget before its handler runs even once if it gets excess-restored enough times.

The problem is proportional to the gap between localConcurrency and localGroupConcurrency. The more workers competing for the same group, the more excess restores happen per cycle and the faster retry budgets are consumed. The same job configuration will appear to behave differently under light versus heavy load, with no obvious correlation to anything in the application logs.

Groups with many queued jobs are the most affected because the excess restore cycle runs continuously for any group where active jobs are at the local limit. In a multi-tenant system this means a busy tenant can have all of their queued jobs cycling through fetch, restore and re-fetch simultaneously, burning retry credits across the board.

The failing test

The test uses batchSize: 2 with a single worker and localGroupConcurrency: 1 to make the excess path trigger deterministically. With two group jobs in the queue, the single fetch grabs both and sets both to active. #trackLocalGroupStart allows the first and restores the second. The test then waits for both jobs to complete (their handlers always succeed) and queries the database to assert that both have retry_count = 0. With the bug present, the restored job has retry_count = 1 despite having completed successfully on its first real execution.

The fix

restoreJobs now also resets started_on and heartbeat_on to NULL:

SET state = 'created',
    started_on = NULL,
    heartbeat_on = NULL

Clearing started_on means the next fetch sees started_on IS NULL and does not increment retry_count. The job's retry budget is fully intact when the handler finally runs. Clearing heartbeat_on is a housekeeping measure so that the heartbeat expiry check, which computes heartbeat_on + heartbeat_seconds, cannot produce a spurious timeout against a stale timestamp from a prior activation that was immediately undone.

A reasonable concern is whether clearing started_on could discard a legitimate retry signal for a job that was already in retry state when it was fetched. It does not. When a retry-state job is fetched, the retry_count increment fires immediately as part of that fetch UPDATE because started_on was already set from the previous genuine run. That increment is committed to the database before restoreJobs is ever called. Clearing started_on afterwards only prevents a second increment from firing on the next fetch. The count that reflects the real history of the job is already written and we do not touch it. retry_count is the canonical record of how many times a job has been activated and started_on is just the mechanism the fetch UPDATE uses to decide whether to fire the increment. Once it has fired for a given activation cycle, clearing it is safe. (I think, double check me maybe).

bcomnes added 2 commits April 8, 2026 14:27
…store

When batchSize > 1 causes a job to be excess-restored, restoreJobs only
resets state to created and leaves started_on set. The next fetch sees
started_on IS NOT NULL and increments retry_count despite the handler
never having run.
When localGroupConcurrency marks a job as excess and restores it to
created state, restoreJobs only reset state, leaving started_on set from
the initial activation. On re-fetch the UPDATE increments retry_count
via CASE WHEN started_on IS NOT NULL, burning a retry credit without the
handler ever having run.
Copilot AI review requested due to automatic review settings April 9, 2026 00:30
@bcomnes
Copy link
Copy Markdown
Contributor Author

bcomnes commented Apr 9, 2026

First commit is the failing test

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a concurrency edge case where jobs that are briefly activated (then “excess-restored” due to localGroupConcurrency) can have retry_count incorrectly incremented on a later fetch because started_on remains set.

Changes:

  • Adds a regression test asserting that retry_count remains 0 for jobs that were excess-restored and later successfully completed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bcomnes
Copy link
Copy Markdown
Contributor Author

bcomnes commented Apr 9, 2026

Second commit is the fix

@bcomnes
Copy link
Copy Markdown
Contributor Author

bcomnes commented Apr 9, 2026

Would love review/feedback @kibertoad and @timgit

@coveralls
Copy link
Copy Markdown

coveralls commented Apr 9, 2026

Coverage Status

coverage: 100.0%. remained the same — bcomnes:restore-jobs-started-on into timgit:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants