Skip to content

dashboard: restart abandoned AI jobs#6889

Merged
a-nogikh merged 4 commits intogoogle:masterfrom
a-nogikh:features/ai-job-restart-prepare
Mar 10, 2026
Merged

dashboard: restart abandoned AI jobs#6889
a-nogikh merged 4 commits intogoogle:masterfrom
a-nogikh:features/ai-job-restart-prepare

Conversation

@a-nogikh
Copy link
Collaborator

@a-nogikh a-nogikh commented Mar 6, 2026

Cc #6781.

The changeset prepares the AI dashboard for the automatic restart of the aborted jobs. When suggesting the restarted job, we need to give preference to the agent which was already executing it (to reuse its cache), so we need to remember more information in the database.

The agent name is currently equal to dashboard client, but later we'd take this name elsewhere.

@a-nogikh a-nogikh requested a review from dvyukov March 6, 2026 12:33
@dvyukov
Copy link
Collaborator

dvyukov commented Mar 6, 2026

Please show how these new tables/columand will be used in the end. W/o that it's hard to understand if it's the right approach or not.
Btw we should already be restarting failed automatically-created jobs.

@a-nogikh a-nogikh marked this pull request as draft March 6, 2026 14:02
@a-nogikh a-nogikh force-pushed the features/ai-job-restart-prepare branch from 8ef6499 to da271e5 Compare March 6, 2026 14:10
@a-nogikh a-nogikh marked this pull request as ready for review March 6, 2026 14:10
@a-nogikh a-nogikh changed the title dashboard: prepare for the failed AI jobs auto-restart dashboard: restart abandoned AI jobs Mar 6, 2026
@a-nogikh
Copy link
Collaborator Author

a-nogikh commented Mar 6, 2026

Added the actual logic. We indeed also recreated automatic jobs in autoCreateAIJob, but the new approach is more flexible, at least for the same-agent restarts - we can now safely kill syz-agent mid-job and let it restart (it will be very useful for rollouts). Also, the new logic also covers manually created jobs.

@a-nogikh a-nogikh force-pushed the features/ai-job-restart-prepare branch from da271e5 to 08987dc Compare March 9, 2026 11:57
For now, use a dashboard client name. In a prod deployment, that would
be a persistent name of the machine where the syz-agent is hosted.

This information be used later for restarting the aborted jobs.
@a-nogikh a-nogikh force-pushed the features/ai-job-restart-prepare branch from 08987dc to b6e1288 Compare March 9, 2026 23:47
Update the timestamp on each job poll or each trajectory update.
If an agent has taken a job and then, without finishing it, comes for
the next one, recreate that job.

If an agent has started working on a job, not finished it and completely
disappeared for more than 8 hours, give the job to another agent.

At the same time, don't look at the stale jobs in the automatic job
creation logic - let it focus on recreating explicitly failed jobs.
Make Workflows be interleaved into Agents. This enables a more complete
data representation and enables a more precise tracking of the active
workflow types.
@a-nogikh a-nogikh force-pushed the features/ai-job-restart-prepare branch from b6e1288 to 23df962 Compare March 10, 2026 00:53
@a-nogikh a-nogikh added this pull request to the merge queue Mar 10, 2026
Merged via the queue into google:master with commit 3adee96 Mar 10, 2026
20 checks passed
@a-nogikh
Copy link
Collaborator Author

Deployed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants