localGroupConcurrency excess restore corrupts retry_count by bcomnes · Pull Request #759 · timgit/pg-boss

bcomnes · 2026-04-09T00:30:35Z

I spotted another bug when working on #756

When localConcurrency exceeds localGroupConcurrency (the typical setup), multiple workers fetch jobs concurrently. Because the fetch query sets all fetched rows to active in the database before the application layer checks local concurrency limits, some workers will end up with jobs they cannot process. These excess jobs are restored to created state via restoreJobs. The problem is that restoreJobs only resets state and leaves started_on set from the initial activation.

The fetch UPDATE in fetchNextJob determines whether to increment retry_count based on the pre-update value of started_on:

retry_count = CASE WHEN started_on IS NOT NULL THEN retry_count + 1 ELSE retry_count END

The intent is to track genuine retries: if started_on is already set when a job is being activated, it must have been run before. But after an excess restore, started_on is still set even though the handler never ran. The next time the job is fetched, started_on IS NOT NULL triggers the increment and the job silently burns a retry credit it never used.

How it manifests

This is difficult to notice because there is no error, no log line and nothing on the error event. Jobs simply move to failed state sooner than their retryLimit suggests they should. A job configured with retryLimit: 2 could exhaust its entire retry budget before its handler runs even once if it gets excess-restored enough times.

The problem is proportional to the gap between localConcurrency and localGroupConcurrency. The more workers competing for the same group, the more excess restores happen per cycle and the faster retry budgets are consumed. The same job configuration will appear to behave differently under light versus heavy load, with no obvious correlation to anything in the application logs.

Groups with many queued jobs are the most affected because the excess restore cycle runs continuously for any group where active jobs are at the local limit. In a multi-tenant system this means a busy tenant can have all of their queued jobs cycling through fetch, restore and re-fetch simultaneously, burning retry credits across the board.

The failing test

The test uses batchSize: 2 with a single worker and localGroupConcurrency: 1 to make the excess path trigger deterministically. With two group jobs in the queue, the single fetch grabs both and sets both to active. #trackLocalGroupStart allows the first and restores the second. The test then waits for both jobs to complete (their handlers always succeed) and queries the database to assert that both have retry_count = 0. With the bug present, the restored job has retry_count = 1 despite having completed successfully on its first real execution.

The fix

restoreJobs now also resets started_on and heartbeat_on to NULL:

SET state = 'created',
    started_on = NULL,
    heartbeat_on = NULL

Clearing started_on means the next fetch sees started_on IS NULL and does not increment retry_count. The job's retry budget is fully intact when the handler finally runs. Clearing heartbeat_on is a housekeeping measure so that the heartbeat expiry check, which computes heartbeat_on + heartbeat_seconds, cannot produce a spurious timeout against a stale timestamp from a prior activation that was immediately undone.

A reasonable concern is whether clearing started_on could discard a legitimate retry signal for a job that was already in retry state when it was fetched. It does not. When a retry-state job is fetched, the retry_count increment fires immediately as part of that fetch UPDATE because started_on was already set from the previous genuine run. That increment is committed to the database before restoreJobs is ever called. Clearing started_on afterwards only prevents a second increment from firing on the next fetch. The count that reflects the real history of the job is already written and we do not touch it. retry_count is the canonical record of how many times a job has been activated and started_on is just the mechanism the fetch UPDATE uses to decide whether to fire the increment. Once it has fired for a given activation cycle, clearing it is safe. (I think, double check me maybe).

…store When batchSize > 1 causes a job to be excess-restored, restoreJobs only resets state to created and leaves started_on set. The next fetch sees started_on IS NOT NULL and increments retry_count despite the handler never having run.

When localGroupConcurrency marks a job as excess and restores it to created state, restoreJobs only reset state, leaving started_on set from the initial activation. On re-fetch the UPDATE increments retry_count via CASE WHEN started_on IS NOT NULL, burning a retry credit without the handler ever having run.

bcomnes · 2026-04-09T00:30:47Z

First commit is the failing test

Copilot

Pull request overview

This PR targets a concurrency edge case where jobs that are briefly activated (then “excess-restored” due to localGroupConcurrency) can have retry_count incorrectly incremented on a later fetch because started_on remains set.

Changes:

Adds a regression test asserting that retry_count remains 0 for jobs that were excess-restored and later successfully completed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/concurrencyLocalGroupTest.ts

bcomnes · 2026-04-09T00:38:54Z

Second commit is the fix

bcomnes · 2026-04-09T00:40:34Z

Would love review/feedback @kibertoad and @timgit

coveralls · 2026-04-09T00:43:03Z

coverage: 100.0%. remained the same — bcomnes:restore-jobs-started-on into timgit:master

bcomnes added 2 commits April 8, 2026 14:27

Copilot AI review requested due to automatic review settings April 9, 2026 00:30

Copilot started reviewing on behalf of bcomnes April 9, 2026 00:31 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

test/concurrencyLocalGroupTest.ts Outdated Show resolved Hide resolved

test/concurrencyLocalGroupTest.ts Show resolved Hide resolved

Fix feedback

561adee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

localGroupConcurrency excess restore corrupts retry_count#759

localGroupConcurrency excess restore corrupts retry_count#759
bcomnes wants to merge 3 commits intotimgit:masterfrom
bcomnes:restore-jobs-started-on

bcomnes commented Apr 9, 2026 •

edited

Loading

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

coveralls commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

bcomnes commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it manifests

The failing test

The fix

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

bcomnes commented Apr 9, 2026

Uh oh!

coveralls commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bcomnes commented Apr 9, 2026 •

edited

Loading

coveralls commented Apr 9, 2026 •

edited

Loading