fix(scheduler): persist descriptive error on dispatch timeout (#1022)#1202
Merged
Conversation
Cron-triggered executions could land as schedule_executions.status='failed'
with a blank `error` and no `response`, giving operators zero diagnostics. Two
scheduler defects combined: the scheduler->backend execute-task POST had a
hardcoded 30s deadline with no try/except, so an httpx.TimeoutException
propagated raw; and the generic failure handler recorded str(e), which is the
empty string for httpx timeouts -- persisted verbatim.
- config.py: add tunable dispatch_timeout (DISPATCH_TIMEOUT, default 30) and
pre_check_timeout (PRE_CHECK_TIMEOUT, default 70) float fields, following the
agent_timeout default_factory idiom.
- service.py: _describe_exception() normalizer (never persist a blank/whitespace
error -- falls back to the type name); narrow try/except httpx.TimeoutException
around the dispatch POST that re-raises an outcome-neutral, descriptive message
("... timed out after Ns (ReadTimeout) -- outcome unknown") with `from e`;
source the deadline from config; apply the normalizer at the cron, retry, and
process-schedule generic handlers; _run_pre_check uses config.pre_check_timeout
+ a typed fail-open log. Stale 30s/70s comments updated to reference config.
- docker-compose{,.prod}.yml: declare DISPATCH_TIMEOUT/PRE_CHECK_TIMEOUT as
host-overridable env vars on the scheduler service.
The backend event-loop stall that triggers the timeout, and the dispatch-timeout
control-flow gap (orphan task / no retry / read-then-write race), are out of
scope and tracked as a follow-up.
Tests (tests/scheduler_tests/): config asserts (both fields default + override),
_describe_exception both branches, descriptive-raise on ReadTimeout(''),
config-driven dispatch timeout, and a regression proving the persisted error is
non-empty end-to-end (the #1022 signature, inverted). 189 passed.
Closes #1022
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) `_execute_retry` is the second persistence site the #1022 fix touched. Add TestRetryDispatchTimeoutRegression proving a blank-stringifying httpx.ReadTimeout('') raised during a retry dispatch lands a non-empty `error` on the retry's FAILED row (the #1022 signature, inverted). Refs #1022 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lank (#1022) Add TestRunPreCheckTimeout: the pre-check POST deadline now comes from config.pre_check_timeout (was a 70.0 literal), and a blank-stringifying httpx.ReadTimeout('') still fails open (returns None) rather than propagating. Refs #1022 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update scheduler feature flows for the #1022 fix: new DISPATCH_TIMEOUT / PRE_CHECK_TIMEOUT config rows, the named non-blank dispatch-timeout raise plus the _describe_exception() normalizer, the error-handling matrix entries, the test catalog, the change history, and a Recent Updates index row. Refs #1022 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BACKEND_URL is read by config.py to build OAuth redirect URIs
({BACKEND_URL}/api/oauth/{provider}/callback) and is documented in
.env.example, but neither compose file listed it in the backend
service's environment block and there is no env_file: directive — so
setting it in .env had no effect and callbacks silently fell back to
the localhost default (the #1039 packaging-gap class).
Wire it into the backend env in both compose files with a
behaviour-preserving :-http://localhost:8000 default (matches the
config.py default and the .env.example value). Deliberately NOT wired
into the scheduler, which reads BACKEND_URL with its own internal
default http://backend:8000 — injecting the backend's localhost default
there would break the scheduler->backend hop.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…container The Slack Socket Mode connection count (#244) is read by adapters/transports/slack_socket.py and documented in .env.example, but was absent from both compose files' backend environment block — so an operator tuning it in .env had no effect (the #1039 packaging-gap class). Wire it through with a :-2 default matching the code default (value is clamped to [1,10]). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
TELEMETRY_CONTAINER_STATS_TTL and TELEMETRY_DOCKER_POOL_SIZE are read by routers/telemetry.py and documented in .env.example, but were absent from both compose files' backend environment block — so tuning them in .env had no effect (the #1039 packaging-gap class). Wire both through with :-10 / :-16 defaults matching the code defaults. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Output of the config-validation pass over .env.example, docker-compose*.yml and the backend/scheduler/mcp sources. Records the one functional finding (four documented .env vars not forwarded to the backend container — now resolved) plus the informational compose- and code-only knobs absent from .env.example. Follows the existing docs/reports/ convention. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Resolve by running |
vybe
approved these changes
Jun 15, 2026
vybe
left a comment
Contributor
There was a problem hiding this comment.
✅ Validated & approved (/validate-pr)
P1 reliability bug fix (Closes #1022). 12 files, +544/-18.
Core fix (correct & complete):
_describe_exception()(.strip()→ type-name fallback) ensures no blank/whitespaceerroris ever persisted, applied at every cron/retry/process-schedule path.- Narrow
try/except httpx.TimeoutExceptionaround the dispatch POST re-raises a named, non-blank, outcome-neutral message — correctly framed as UNKNOWN (the backend spawns the bg task before replying → may already be running → orphan recovered by cleanup), not 'rejected'. DISPATCH_TIMEOUT(30s) /PRE_CHECK_TIMEOUT(70s) lifted from literals to config/env; pre-check stays fail-open.- 37 scheduler tests pass (incl. retry-path blank-error + pre-check config-deadline regressions).
Packaging: both new scheduler vars wired into the scheduler block of both docker-compose.yml and docker-compose.prod.yml → no #1039/#1056-class gap; feature reachable on deploy.
Security: CSO diff audit CLEAR (richer error strings via parameterized SQL only; dispatch raise interpolates only internal values) + independent scan clean.
- Base →
dev✓ · conventional commits ✓ · CI all green · feature flows updated
Accepted with notes (reviewed, not blocking):
- Scope: this PR also bundles 4 unrelated backend config-packaging fixes (
BACKEND_URL,SLACK_SOCKET_CONNECTION_COUNT,TELEMETRY_CONTAINER_STATS_TTL,TELEMETRY_DOCKER_POOL_SIZE) + a config-drift report. All additive${VAR:-default}lines (behavior unchanged when unset), isolated in their own commits,docker compose config -qvalidated, and they close real deploy-time packaging gaps — accepted as net-positive. - Follow-up nit:
DISPATCH_TIMEOUT/PRE_CHECK_TIMEOUTare not yet in.env.example(the env levers work regardless via the compose defaults). Worth a 2-line doc add in a later PR.
# Conflicts: # docs/memory/feature-flows.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cron-triggered executions could land as
schedule_executions.status='failed'with a blankerrorand noresponse, giving operators zero diagnostics. Ascheduler→backenddispatch (or pre-check)httpx.TimeoutExceptionstringifies to'', which the generic failure handler persisted verbatim.The fix:
_describe_exception()normalizer — never persists a blank/whitespaceerror; falls back to the exception type name. Applied at every cron / retry / process-schedule error path.try/except httpx.TimeoutExceptionaround the dispatch POST that re-raises a named, non-blank, outcome-neutral message ("dispatch ... timed out after {N}s ({TimeoutType}) — outcome unknown"). The outcome is genuinely UNKNOWN (the backend spawns the bg task before replying → may already be running → orphan recovered by cleanup).DISPATCH_TIMEOUT(default 30s) andPRE_CHECK_TIMEOUT(default 70s), declared on the scheduler service in both compose files.Scheduler-only (
src/scheduler/). The backend event-loop stall that triggers the timeout, and the dispatch-timeout control-flow gap (orphan/no-retry/read-then-write race), are explicitly out of scope and tracked as a follow-up.What's in this branch
fix(scheduler): persist descriptive error on dispatch timeoutconfig.py,service.py, both compose files + core teststest(scheduler): regression for retry-path dispatch-timeout errorTestRetryDispatchTimeoutRegression— retry path persists non-empty errortest(scheduler): pre-check timeout is config-driven + fails open on blankTestRunPreCheckTimeout— config deadline + blank-timeout fail-opendocs(scheduler): document dispatch-timeout error persistencedocs(security): CSO diff audit for #1022 — CLEARTest evidence
The 3 net-new tests (
TestRetryDispatchTimeoutRegression,TestRunPreCheckTimeout::*) pass; together they assert the persistederroris never blank and the pre-check still fails open on a blank-stringifying timeout.Security
CSO diff audit (
docs/security-reports/cso-2026-06-13-1022-diff.md): CLEAR — richererrorstrings are written only via parameterized SQL; the dispatch raise interpolates only internal values (timeout float, static path, type name); no secret-leak via exception text.Closes #1022
🤖 Generated with Claude Code