Skip to content

[πŸ› Bug]: process_watcher_task leaks SQLAlchemy sessions β€” pool exhaustion in arqworkerΒ #5610

@Rolled

Description

@Rolled

Description

The process_watcher_task cron task in the arqworker leaks SQLAlchemy sessions on every execution cycle (~60s). The process_event_worker_0 and process_event_worker_1 threads execute SELECT queries but never COMMIT or ROLLBACK, leaving connections in PostgreSQL's idle in transaction state until they accumulate and exhaust the connection pool.

This breaks all workflow execution (both alert-triggered and cron-scheduled) because the WorkflowScheduler thread shares the same pool and can no longer acquire connections.

This occurs at near-zero alert volume β€” our instance processes <1 alert/minute. It is not a scaling issue.

Environment

  • Keep version: 0.48.1 (Helm chart 0.1.94)
  • PostgreSQL: 16.11 (CloudNativePG)
  • Deployment: 4 gunicorn/uvicorn workers + 2 ARQ workers (REDIS=true, SCHEDULER=true, CONSUMER=true)
  • Pool config: defaults (pool_size=5, max_overflow=10)

Reproduction

No special steps required. After a fresh pod start, the pool exhausts within 30-60 minutes automatically.

Evidence

1. pg_stat_activity β€” leaked sessions

Connections accumulate in pairs every ~60s (one per check in process_watcher_task):

  pid   |        state        |    xact_age     | query
--------+---------------------+-----------------+-------
 565037 | idle in transaction | 00:04:06        | SELECT alert... WHERE status='maintenance'
 565038 | idle in transaction | 00:04:06        | SELECT alertenrichment... WHERE dismissed='true'
 565039 | idle in transaction | 00:03:30        | SELECT alert... WHERE status='maintenance'
 565040 | idle in transaction | 00:03:30        | SELECT alertenrichment... WHERE dismissed='true'
 565044 | idle in transaction | 00:02:30        | SELECT alert... WHERE status='maintenance'
 565041 | idle in transaction | 00:02:30        | SELECT alertenrichment... WHERE dismissed='true'

Full queries with traceparent:

-- From maintenance_windows_bl.py:184 (recover_strategy)
SELECT alert.id, alert.tenant_id, alert.timestamp, alert.provider_type,
       alert.provider_id, alert.event, alert.fingerprint, alert.alert_hash
FROM alert
WHERE json_extract_path_text(alert.event, 'status') = 'maintenance'

-- From dismissal_expiry_bl.py:65 (get_alerts_with_expired_dismissals)
SELECT alertenrichment.id, alertenrichment.tenant_id, alertenrichment.timestamp,
       alertenrichment.alert_fingerprint, alertenrichment.enrichments
FROM alertenrichment
WHERE (json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'true'
       OR json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'True')
  AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') IS NOT NULL
  AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') != 'forever'

These queries execute successfully but the session is never closed/committed.

2. Pool exhaustion β€” only affects arqworker scheduler process

"worker_type": "arqworker"
"process": 684
"threadName": "WorkflowScheduler_0"

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00

All pool errors originate from a single process (the arqworker running SCHEDULER=true). The 4 uvicorn workers and the second arqworker (consumer-only) are completely unaffected.

3. Arq job health β€” 28% failure rate

j_complete=781 j_failed=304 j_retried=56 j_ongoing=0

4. Call chain

arq cron (every 60s)
  β†’ async_process_watcher (process_watcher_task.py:30)
    β†’ run_in_executor β†’ MaintenanceWindowsBl.recover_strategy (maintenance_windows_bl.py:184)
      β†’ get_maintenance_windows_started (db.py:5952)
        β†’ session.exec(query)  ← connection acquired, never returned
    β†’ run_in_executor β†’ DismissalExpiryBl.check_dismissal_expiry (dismissal_expiry_bl.py:144)
      β†’ get_alerts_with_expired_dismissals (dismissal_expiry_bl.py:65)
        β†’ session.exec(query)  ← connection acquired, never returned

The process_event_worker_0 and process_event_worker_1 threads execute their queries and return, but the session's transaction remains open. The connection is not returned to the pool.

Impact

Component Status
API / UI Working (uvicorn workers have separate pools)
Alert ingestion Working
WorkflowScheduler Broken (can't query Workflow table)
Alert-triggered workflows Broken
Cron workflows Broken

Workaround

Setting PostgreSQL idle_in_transaction_session_timeout to 60s (from default 0 or higher values) causes the server to kill leaked connections before the next cron cycle creates new ones. Combined with KEEP_DB_PRE_PING_ENABLED=true so the pool detects killed connections.

This is a server-side bandaid β€” the application should properly close sessions.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions