Skip to content

fix(scheduler): persist descriptive error on dispatch timeout (#1022)#1202

Merged
vybe merged 10 commits into
devfrom
AndriiPasternak31/plan-issue-1022
Jun 16, 2026
Merged

fix(scheduler): persist descriptive error on dispatch timeout (#1022)#1202
vybe merged 10 commits into
devfrom
AndriiPasternak31/plan-issue-1022

Conversation

@AndriiPasternak31

Copy link
Copy Markdown
Contributor

Summary

Cron-triggered executions could land as schedule_executions.status='failed' with a blank error and no response, giving operators zero diagnostics. A scheduler→backend dispatch (or pre-check) httpx.TimeoutException stringifies to '', which the generic failure handler persisted verbatim.

The fix:

  • _describe_exception() normalizer — never persists a blank/whitespace error; falls back to the exception type name. Applied at every cron / retry / process-schedule error path.
  • Narrow try/except httpx.TimeoutException around the dispatch POST that re-raises a named, non-blank, outcome-neutral message ("dispatch ... timed out after {N}s ({TimeoutType}) — outcome unknown"). The outcome is genuinely UNKNOWN (the backend spawns the bg task before replying → may already be running → orphan recovered by cleanup).
  • Two hard-coded timeout literals lifted to operator-tunable config/env: DISPATCH_TIMEOUT (default 30s) and PRE_CHECK_TIMEOUT (default 70s), declared on the scheduler service in both compose files.

Scheduler-only (src/scheduler/). The backend event-loop stall that triggers the timeout, and the dispatch-timeout control-flow gap (orphan/no-retry/read-then-write race), are explicitly out of scope and tracked as a follow-up.

What's in this branch

Commit What
fix(scheduler): persist descriptive error on dispatch timeout The fix: config.py, service.py, both compose files + core tests
test(scheduler): regression for retry-path dispatch-timeout error TestRetryDispatchTimeoutRegression — retry path persists non-empty error
test(scheduler): pre-check timeout is config-driven + fails open on blank TestRunPreCheckTimeout — config deadline + blank-timeout fail-open
docs(scheduler): document dispatch-timeout error persistence scheduler feature flows + Recent Updates index row
docs(security): CSO diff audit for #1022 — CLEAR CSO diff-mode audit, no findings

Test evidence

$ pytest tests/scheduler_tests/test_async_dispatch.py tests/scheduler_tests/test_pre_check.py -q
37 passed

The 3 net-new tests (TestRetryDispatchTimeoutRegression, TestRunPreCheckTimeout::*) pass; together they assert the persisted error is never blank and the pre-check still fails open on a blank-stringifying timeout.

Security

CSO diff audit (docs/security-reports/cso-2026-06-13-1022-diff.md): CLEAR — richer error strings are written only via parameterized SQL; the dispatch raise interpolates only internal values (timeout float, static path, type name); no secret-leak via exception text.

Closes #1022

🤖 Generated with Claude Code

AndriiPasternak31 and others added 5 commits June 13, 2026 15:58
Cron-triggered executions could land as schedule_executions.status='failed'
with a blank `error` and no `response`, giving operators zero diagnostics. Two
scheduler defects combined: the scheduler->backend execute-task POST had a
hardcoded 30s deadline with no try/except, so an httpx.TimeoutException
propagated raw; and the generic failure handler recorded str(e), which is the
empty string for httpx timeouts -- persisted verbatim.

- config.py: add tunable dispatch_timeout (DISPATCH_TIMEOUT, default 30) and
  pre_check_timeout (PRE_CHECK_TIMEOUT, default 70) float fields, following the
  agent_timeout default_factory idiom.
- service.py: _describe_exception() normalizer (never persist a blank/whitespace
  error -- falls back to the type name); narrow try/except httpx.TimeoutException
  around the dispatch POST that re-raises an outcome-neutral, descriptive message
  ("... timed out after Ns (ReadTimeout) -- outcome unknown") with `from e`;
  source the deadline from config; apply the normalizer at the cron, retry, and
  process-schedule generic handlers; _run_pre_check uses config.pre_check_timeout
  + a typed fail-open log. Stale 30s/70s comments updated to reference config.
- docker-compose{,.prod}.yml: declare DISPATCH_TIMEOUT/PRE_CHECK_TIMEOUT as
  host-overridable env vars on the scheduler service.

The backend event-loop stall that triggers the timeout, and the dispatch-timeout
control-flow gap (orphan task / no retry / read-then-write race), are out of
scope and tracked as a follow-up.

Tests (tests/scheduler_tests/): config asserts (both fields default + override),
_describe_exception both branches, descriptive-raise on ReadTimeout(''),
config-driven dispatch timeout, and a regression proving the persisted error is
non-empty end-to-end (the #1022 signature, inverted). 189 passed.

Closes #1022

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
)

`_execute_retry` is the second persistence site the #1022 fix touched. Add
TestRetryDispatchTimeoutRegression proving a blank-stringifying
httpx.ReadTimeout('') raised during a retry dispatch lands a non-empty
`error` on the retry's FAILED row (the #1022 signature, inverted).

Refs #1022

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lank (#1022)

Add TestRunPreCheckTimeout: the pre-check POST deadline now comes from
config.pre_check_timeout (was a 70.0 literal), and a blank-stringifying
httpx.ReadTimeout('') still fails open (returns None) rather than propagating.

Refs #1022

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update scheduler feature flows for the #1022 fix: new DISPATCH_TIMEOUT /
PRE_CHECK_TIMEOUT config rows, the named non-blank dispatch-timeout raise plus
the _describe_exception() normalizer, the error-handling matrix entries, the
test catalog, the change history, and a Recent Updates index row.

Refs #1022

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the CSO diff-mode audit for the #1022 scheduler fix: no secrets,
dependency, auth-boundary, injection, or configuration findings. CLEAR verdict,
safe to merge from a security standpoint.

Refs #1022

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AndriiPasternak31 and others added 4 commits June 14, 2026 01:14
BACKEND_URL is read by config.py to build OAuth redirect URIs
({BACKEND_URL}/api/oauth/{provider}/callback) and is documented in
.env.example, but neither compose file listed it in the backend
service's environment block and there is no env_file: directive — so
setting it in .env had no effect and callbacks silently fell back to
the localhost default (the #1039 packaging-gap class).

Wire it into the backend env in both compose files with a
behaviour-preserving :-http://localhost:8000 default (matches the
config.py default and the .env.example value). Deliberately NOT wired
into the scheduler, which reads BACKEND_URL with its own internal
default http://backend:8000 — injecting the backend's localhost default
there would break the scheduler->backend hop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…container

The Slack Socket Mode connection count (#244) is read by
adapters/transports/slack_socket.py and documented in .env.example, but
was absent from both compose files' backend environment block — so an
operator tuning it in .env had no effect (the #1039 packaging-gap
class). Wire it through with a :-2 default matching the code default
(value is clamped to [1,10]).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TELEMETRY_CONTAINER_STATS_TTL and TELEMETRY_DOCKER_POOL_SIZE are read by
routers/telemetry.py and documented in .env.example, but were absent
from both compose files' backend environment block — so tuning them in
.env had no effect (the #1039 packaging-gap class). Wire both through
with :-10 / :-16 defaults matching the code defaults.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Output of the config-validation pass over .env.example,
docker-compose*.yml and the backend/scheduler/mcp sources. Records the
one functional finding (four documented .env vars not forwarded to the
backend container — now resolved) plus the informational compose- and
code-only knobs absent from .env.example. Follows the existing
docs/reports/ convention.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

⚠️ Nightly unit-suite check skipped — merge conflict against dev.

Resolve by running git merge dev locally and pushing the result. The next nightly run will re-test once the conflict is gone.

@vybe vybe left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated & approved (/validate-pr)

P1 reliability bug fix (Closes #1022). 12 files, +544/-18.

Core fix (correct & complete):

  • _describe_exception() (.strip() → type-name fallback) ensures no blank/whitespace error is ever persisted, applied at every cron/retry/process-schedule path.
  • Narrow try/except httpx.TimeoutException around the dispatch POST re-raises a named, non-blank, outcome-neutral message — correctly framed as UNKNOWN (the backend spawns the bg task before replying → may already be running → orphan recovered by cleanup), not 'rejected'.
  • DISPATCH_TIMEOUT (30s) / PRE_CHECK_TIMEOUT (70s) lifted from literals to config/env; pre-check stays fail-open.
  • 37 scheduler tests pass (incl. retry-path blank-error + pre-check config-deadline regressions).

Packaging: both new scheduler vars wired into the scheduler block of both docker-compose.yml and docker-compose.prod.yml → no #1039/#1056-class gap; feature reachable on deploy.

Security: CSO diff audit CLEAR (richer error strings via parameterized SQL only; dispatch raise interpolates only internal values) + independent scan clean.

  • Base → dev ✓ · conventional commits ✓ · CI all green · feature flows updated

Accepted with notes (reviewed, not blocking):

  • Scope: this PR also bundles 4 unrelated backend config-packaging fixes (BACKEND_URL, SLACK_SOCKET_CONNECTION_COUNT, TELEMETRY_CONTAINER_STATS_TTL, TELEMETRY_DOCKER_POOL_SIZE) + a config-drift report. All additive ${VAR:-default} lines (behavior unchanged when unset), isolated in their own commits, docker compose config -q validated, and they close real deploy-time packaging gaps — accepted as net-positive.
  • Follow-up nit: DISPATCH_TIMEOUT / PRE_CHECK_TIMEOUT are not yet in .env.example (the env levers work regardless via the compose defaults). Worth a 2-line doc add in a later PR.

# Conflicts:
#	docs/memory/feature-flows.md
@vybe vybe merged commit 9f12050 into dev Jun 16, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants