-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
The process_watcher_task cron task in the arqworker leaks SQLAlchemy sessions on every execution cycle (~60s). The process_event_worker_0 and process_event_worker_1 threads execute SELECT queries but never COMMIT or ROLLBACK, leaving connections in PostgreSQL's idle in transaction state until they accumulate and exhaust the connection pool.
This breaks all workflow execution (both alert-triggered and cron-scheduled) because the WorkflowScheduler thread shares the same pool and can no longer acquire connections.
This occurs at near-zero alert volume β our instance processes <1 alert/minute. It is not a scaling issue.
Environment
- Keep version: 0.48.1 (Helm chart 0.1.94)
- PostgreSQL: 16.11 (CloudNativePG)
- Deployment: 4 gunicorn/uvicorn workers + 2 ARQ workers (REDIS=true, SCHEDULER=true, CONSUMER=true)
- Pool config: defaults (pool_size=5, max_overflow=10)
Reproduction
No special steps required. After a fresh pod start, the pool exhausts within 30-60 minutes automatically.
Evidence
1. pg_stat_activity β leaked sessions
Connections accumulate in pairs every ~60s (one per check in process_watcher_task):
pid | state | xact_age | query
--------+---------------------+-----------------+-------
565037 | idle in transaction | 00:04:06 | SELECT alert... WHERE status='maintenance'
565038 | idle in transaction | 00:04:06 | SELECT alertenrichment... WHERE dismissed='true'
565039 | idle in transaction | 00:03:30 | SELECT alert... WHERE status='maintenance'
565040 | idle in transaction | 00:03:30 | SELECT alertenrichment... WHERE dismissed='true'
565044 | idle in transaction | 00:02:30 | SELECT alert... WHERE status='maintenance'
565041 | idle in transaction | 00:02:30 | SELECT alertenrichment... WHERE dismissed='true'
Full queries with traceparent:
-- From maintenance_windows_bl.py:184 (recover_strategy)
SELECT alert.id, alert.tenant_id, alert.timestamp, alert.provider_type,
alert.provider_id, alert.event, alert.fingerprint, alert.alert_hash
FROM alert
WHERE json_extract_path_text(alert.event, 'status') = 'maintenance'
-- From dismissal_expiry_bl.py:65 (get_alerts_with_expired_dismissals)
SELECT alertenrichment.id, alertenrichment.tenant_id, alertenrichment.timestamp,
alertenrichment.alert_fingerprint, alertenrichment.enrichments
FROM alertenrichment
WHERE (json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'true'
OR json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'True')
AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') IS NOT NULL
AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') != 'forever'These queries execute successfully but the session is never closed/committed.
2. Pool exhaustion β only affects arqworker scheduler process
"worker_type": "arqworker"
"process": 684
"threadName": "WorkflowScheduler_0"
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
All pool errors originate from a single process (the arqworker running SCHEDULER=true). The 4 uvicorn workers and the second arqworker (consumer-only) are completely unaffected.
3. Arq job health β 28% failure rate
j_complete=781 j_failed=304 j_retried=56 j_ongoing=0
4. Call chain
arq cron (every 60s)
β async_process_watcher (process_watcher_task.py:30)
β run_in_executor β MaintenanceWindowsBl.recover_strategy (maintenance_windows_bl.py:184)
β get_maintenance_windows_started (db.py:5952)
β session.exec(query) β connection acquired, never returned
β run_in_executor β DismissalExpiryBl.check_dismissal_expiry (dismissal_expiry_bl.py:144)
β get_alerts_with_expired_dismissals (dismissal_expiry_bl.py:65)
β session.exec(query) β connection acquired, never returned
The process_event_worker_0 and process_event_worker_1 threads execute their queries and return, but the session's transaction remains open. The connection is not returned to the pool.
Impact
| Component | Status |
|---|---|
| API / UI | Working (uvicorn workers have separate pools) |
| Alert ingestion | Working |
| WorkflowScheduler | Broken (can't query Workflow table) |
| Alert-triggered workflows | Broken |
| Cron workflows | Broken |
Workaround
Setting PostgreSQL idle_in_transaction_session_timeout to 60s (from default 0 or higher values) causes the server to kill leaked connections before the next cron cycle creates new ones. Combined with KEEP_DB_PRE_PING_ENABLED=true so the pool detects killed connections.
This is a server-side bandaid β the application should properly close sessions.
Related
- [π Bug]: Critical database connection issues under high alert loadΒ #5496 β reports pool exhaustion under high load (same root cause, different framing)
- [π Bug]: postgresql connection tuneΒ #5480 β reports stale connection errors on
process_event_worker_1(same thread) - PR feat: allow to config pool_recycle and pool_timeoutΒ #5474 β attempted to add
pool_recycle/pool_timeoutconfig (closed, not merged)