feat: New fetch job function. #1241

TheNeedForSleep · 2024-11-20T14:20:44Z

fixes #1242

New fetch job function:

handles on update conflict
works with same lock different priority task order
inner select uses index properly

Does not support the aborting status!

Successful PR Checklist:

Tests
- (not applicable?)
Documentation
- (not applicable?)

PR label(s):

PR type: breaking 💥 Contains breaking changes
PR type: feature ⭐️ Contains new features
PR type: bugfix 🕵️ Contains bug fix
PR type: miscellaneous 👾 Contains misc changes
PR type: dependencies 🤖 Contains only dependencies updates
PR type: documentation 📚 Contains documentation updates

New fetch job function: - handles on update conflict - works with same lock different priority task order - inner select uses index properly

Fetch job with retry after max retries get any doable job instead

TheNeedForSleep · 2024-12-03T13:52:27Z

The code is not ready to be merged!

TkTech · 2024-12-03T15:02:18Z

The code is not ready to be merged!

You can mark the PR as a draft to make sure it doesn't get accidentally merged :)

ewjoachim · 2024-12-04T13:35:29Z

When the PR is ready, press the big "Ready for review" button

joaquimds · 2024-12-13T11:18:03Z

Thanks for this PR, @TheNeedForSleep ! I have just come up against this issue in a project I'm working on. I have temporarily "fixed" it for myself with pretty much the same SQL change:

CREATE OR REPLACE FUNCTION procrastinate_fetch_job(
    target_queue_names character varying[]
)
    RETURNS procrastinate_jobs
    LANGUAGE plpgsql
AS $$
DECLARE
	found_jobs procrastinate_jobs;
BEGIN
    WITH candidate AS (
        SELECT jobs.*
            FROM procrastinate_jobs AS jobs
            WHERE
                (
                    jobs.lock IS NULL OR
                    -- reject the job if its lock has earlier jobs
                    NOT EXISTS (
                        SELECT 1
                            FROM procrastinate_jobs AS earlier_jobs
                            WHERE
                                earlier_jobs.lock = jobs.lock
                                AND earlier_jobs.status IN ('todo', 'doing', 'aborting')
                                AND earlier_jobs.id < jobs.id)
                )
                AND jobs.status = 'todo'
                AND (target_queue_names IS NULL OR jobs.queue_name = ANY( target_queue_names ))
                AND (jobs.scheduled_at IS NULL OR jobs.scheduled_at <= now())
            ORDER BY jobs.priority DESC, jobs.id ASC LIMIT 1
            FOR UPDATE OF jobs SKIP LOCKED
    )
    UPDATE procrastinate_jobs
        SET status = 'doing'
        FROM candidate
        WHERE procrastinate_jobs.id = candidate.id
        RETURNING procrastinate_jobs.* INTO found_jobs;

	RETURN found_jobs;
END;
$$;

Question: why is it necessary to make the change in manager.py? It looks to me that the existing query will already fallback to a job where jobs.lock IS NULL.

Add 'aborting' status back in to keep the old behaviour. Adding 'aborting' will slow down the query as the lock index does not include the aborting status.

TheNeedForSleep · 2025-01-16T17:01:07Z

Question: why is it necessary to make the change in manager.py? It looks to me that the existing query will already fallback to a job where jobs.lock IS NULL.

If multiple workers try to fetch tasks with the same lock they will keep running into the same conflicts. To deescalate a worker that fails to fetch a task successfully just grabs the next task with no lock instead of returning None going to sleep and then keep retrying and keep escalating the problem.

I dont know if it is actually necessary to add in this code for everyone but in some situations this problem will occur.
The following factors will lead to the situation:

high number of workers
low job runtime

I will let you guys decide if you actually want that change in the manager.py
Given I now had one month to think about it i dont think it is actually necessary because the cost of
"worker that fails to fetch a job will get a Null Job (sql fetch job returns Null on failure) goes to sleep and will fetch again after either poll sleep time or listen event"

If any of my ideas hard to follow then please let me know :)

TheNeedForSleep · 2025-01-17T10:44:01Z

Motivation for this change:

Worker will crash when trying to add a job with lock that already is in the lock index.

Test setup:
Og current main fetch job function

The following jobs where the fast_job_go_brr is called once.

import asyncio
import logging
import time
from random import choice, random

from pydantic import validate_call

from worker import PRIORITY_QUEUE, IntegrityErrorRetryStrategy, worker

logger = logging.getLogger(__name__)


@worker.task(
    queue=PRIORITY_QUEUE,
    retry=IntegrityErrorRetryStrategy(),
)
@validate_call
async def fast_job_go_brr(
    n_jobs: int = 10_000,
    locks: list | None = None,
):
    if locks is None:
        locks = [None, "A", "B", "C"]

    async with asyncio.TaskGroup() as group:
        for _i in range(n_jobs):
            group.create_task(
                fast_job.configure(
                    lock=choice(locks),  # noqa: S311
                    priority=choice([1, 2, 3]),  # noqa: S311
                ).defer_async()
            )


@worker.task(
    queue=PRIORITY_QUEUE,
    retry=IntegrityErrorRetryStrategy(),
)
@validate_call
async def fast_job(min_wait: float = 0.001, max_wait: float = 0.01, sync_wait=False):
    sleep_time = random() * (max_wait - min_wait) + min_wait  # noqa: S311
    logger.info("Waiting %s", sleep_time)
    if sync_wait:
        time.sleep(sleep_time)
    else:
        await asyncio.sleep(sleep_time)
    logger.info("Done")

feat: New fetch job function.

1c0c2f7

New fetch job function: - handles on update conflict - works with same lock different priority task order - inner select uses index properly

TheNeedForSleep requested a review from a team as a code owner November 20, 2024 14:20

github-actions bot added the PR type: bugfix 🕵️ Contains bug fix label Nov 20, 2024

feat: fetch job with retry

af6f84d

Fetch job with retry after max retries get any doable job instead

fix: raise instead of return

30ae552

ewjoachim marked this pull request as draft December 4, 2024 13:34

TR-items added 3 commits January 16, 2025 17:19

Merge

0e72513

Use correct sql func

6626a23

Add 'aborting' to fetch job

ae19fa1

Add 'aborting' status back in to keep the old behaviour. Adding 'aborting' will slow down the query as the lock index does not include the aborting status.

TheNeedForSleep marked this pull request as ready for review January 16, 2025 17:01

github-actions bot added the PR type: feature ⭐️ Contains new features label Jan 16, 2025

fix migrations

7e1bcac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: New fetch job function. #1241

feat: New fetch job function. #1241

Uh oh!

TheNeedForSleep commented Nov 20, 2024 •

edited

Loading

Uh oh!

TheNeedForSleep commented Dec 3, 2024

Uh oh!

TkTech commented Dec 3, 2024

Uh oh!

ewjoachim commented Dec 4, 2024

Uh oh!

joaquimds commented Dec 13, 2024

Uh oh!

TheNeedForSleep commented Jan 16, 2025 •

edited

Loading

Uh oh!

TheNeedForSleep commented Jan 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

feat: New fetch job function. #1241

Are you sure you want to change the base?

feat: New fetch job function. #1241

Uh oh!

Conversation

TheNeedForSleep commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Successful PR Checklist:

PR label(s):

Uh oh!

TheNeedForSleep commented Dec 3, 2024

Uh oh!

TkTech commented Dec 3, 2024

Uh oh!

ewjoachim commented Dec 4, 2024

Uh oh!

joaquimds commented Dec 13, 2024

Uh oh!

TheNeedForSleep commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheNeedForSleep commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TheNeedForSleep commented Nov 20, 2024 •

edited

Loading

TheNeedForSleep commented Jan 16, 2025 •

edited

Loading

TheNeedForSleep commented Jan 17, 2025 •

edited

Loading