localGroupConcurrency excess restore corrupts retry_count#759
Open
bcomnes wants to merge 3 commits intotimgit:masterfrom
Open
localGroupConcurrency excess restore corrupts retry_count#759bcomnes wants to merge 3 commits intotimgit:masterfrom
bcomnes wants to merge 3 commits intotimgit:masterfrom
Conversation
…store When batchSize > 1 causes a job to be excess-restored, restoreJobs only resets state to created and leaves started_on set. The next fetch sees started_on IS NOT NULL and increments retry_count despite the handler never having run.
When localGroupConcurrency marks a job as excess and restores it to created state, restoreJobs only reset state, leaving started_on set from the initial activation. On re-fetch the UPDATE increments retry_count via CASE WHEN started_on IS NOT NULL, burning a retry credit without the handler ever having run.
Contributor
Author
|
First commit is the failing test |
There was a problem hiding this comment.
Pull request overview
This PR targets a concurrency edge case where jobs that are briefly activated (then “excess-restored” due to localGroupConcurrency) can have retry_count incorrectly incremented on a later fetch because started_on remains set.
Changes:
- Adds a regression test asserting that
retry_countremains0for jobs that were excess-restored and later successfully completed.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
Author
|
Second commit is the fix |
Contributor
Author
|
Would love review/feedback @kibertoad and @timgit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I spotted another bug when working on #756
When
localConcurrencyexceedslocalGroupConcurrency(the typical setup), multiple workers fetch jobs concurrently. Because the fetch query sets all fetched rows toactivein the database before the application layer checks local concurrency limits, some workers will end up with jobs they cannot process. These excess jobs are restored tocreatedstate viarestoreJobs. The problem is thatrestoreJobsonly resetsstateand leavesstarted_onset from the initial activation.The fetch UPDATE in
fetchNextJobdetermines whether to incrementretry_countbased on the pre-update value ofstarted_on:The intent is to track genuine retries: if
started_onis already set when a job is being activated, it must have been run before. But after an excess restore,started_onis still set even though the handler never ran. The next time the job is fetched,started_on IS NOT NULLtriggers the increment and the job silently burns a retry credit it never used.How it manifests
This is difficult to notice because there is no error, no log line and nothing on the
errorevent. Jobs simply move tofailedstate sooner than theirretryLimitsuggests they should. A job configured withretryLimit: 2could exhaust its entire retry budget before its handler runs even once if it gets excess-restored enough times.The problem is proportional to the gap between
localConcurrencyandlocalGroupConcurrency. The more workers competing for the same group, the more excess restores happen per cycle and the faster retry budgets are consumed. The same job configuration will appear to behave differently under light versus heavy load, with no obvious correlation to anything in the application logs.Groups with many queued jobs are the most affected because the excess restore cycle runs continuously for any group where active jobs are at the local limit. In a multi-tenant system this means a busy tenant can have all of their queued jobs cycling through fetch, restore and re-fetch simultaneously, burning retry credits across the board.
The failing test
The test uses
batchSize: 2with a single worker andlocalGroupConcurrency: 1to make the excess path trigger deterministically. With two group jobs in the queue, the single fetch grabs both and sets both to active.#trackLocalGroupStartallows the first and restores the second. The test then waits for both jobs to complete (their handlers always succeed) and queries the database to assert that both haveretry_count = 0. With the bug present, the restored job hasretry_count = 1despite having completed successfully on its first real execution.The fix
restoreJobsnow also resetsstarted_onandheartbeat_ontoNULL:Clearing
started_onmeans the next fetch seesstarted_on IS NULLand does not incrementretry_count. The job's retry budget is fully intact when the handler finally runs. Clearingheartbeat_onis a housekeeping measure so that the heartbeat expiry check, which computesheartbeat_on + heartbeat_seconds, cannot produce a spurious timeout against a stale timestamp from a prior activation that was immediately undone.A reasonable concern is whether clearing
started_oncould discard a legitimate retry signal for a job that was already inretrystate when it was fetched. It does not. When a retry-state job is fetched, theretry_countincrement fires immediately as part of that fetch UPDATE becausestarted_onwas already set from the previous genuine run. That increment is committed to the database beforerestoreJobsis ever called. Clearingstarted_onafterwards only prevents a second increment from firing on the next fetch. The count that reflects the real history of the job is already written and we do not touch it.retry_countis the canonical record of how many times a job has been activated andstarted_onis just the mechanism the fetch UPDATE uses to decide whether to fire the increment. Once it has fired for a given activation cycle, clearing it is safe. (I think, double check me maybe).