[🐛 Bug]: process_watcher_task leaks SQLAlchemy sessions — pool exhaustion in arqworker

## Description

The `process_watcher_task` cron task in the arqworker leaks SQLAlchemy sessions on every execution cycle (~60s). The `process_event_worker_0` and `process_event_worker_1` threads execute SELECT queries but never COMMIT or ROLLBACK, leaving connections in PostgreSQL's `idle in transaction` state until they accumulate and exhaust the connection pool.

This breaks all workflow execution (both alert-triggered and cron-scheduled) because the `WorkflowScheduler` thread shares the same pool and can no longer acquire connections.

**This occurs at near-zero alert volume** — our instance processes <1 alert/minute. It is not a scaling issue.

## Environment

- **Keep version**: 0.48.1 (Helm chart 0.1.94)
- **PostgreSQL**: 16.11 (CloudNativePG)
- **Deployment**: 4 gunicorn/uvicorn workers + 2 ARQ workers (REDIS=true, SCHEDULER=true, CONSUMER=true)
- **Pool config**: defaults (pool_size=5, max_overflow=10)

## Reproduction

No special steps required. After a fresh pod start, the pool exhausts within 30-60 minutes automatically.

## Evidence

### 1. `pg_stat_activity` — leaked sessions

Connections accumulate in pairs every ~60s (one per check in `process_watcher_task`):

```
  pid   |        state        |    xact_age     | query
--------+---------------------+-----------------+-------
 565037 | idle in transaction | 00:04:06        | SELECT alert... WHERE status='maintenance'
 565038 | idle in transaction | 00:04:06        | SELECT alertenrichment... WHERE dismissed='true'
 565039 | idle in transaction | 00:03:30        | SELECT alert... WHERE status='maintenance'
 565040 | idle in transaction | 00:03:30        | SELECT alertenrichment... WHERE dismissed='true'
 565044 | idle in transaction | 00:02:30        | SELECT alert... WHERE status='maintenance'
 565041 | idle in transaction | 00:02:30        | SELECT alertenrichment... WHERE dismissed='true'
```

Full queries with traceparent:

```sql
-- From maintenance_windows_bl.py:184 (recover_strategy)
SELECT alert.id, alert.tenant_id, alert.timestamp, alert.provider_type,
       alert.provider_id, alert.event, alert.fingerprint, alert.alert_hash
FROM alert
WHERE json_extract_path_text(alert.event, 'status') = 'maintenance'

-- From dismissal_expiry_bl.py:65 (get_alerts_with_expired_dismissals)
SELECT alertenrichment.id, alertenrichment.tenant_id, alertenrichment.timestamp,
       alertenrichment.alert_fingerprint, alertenrichment.enrichments
FROM alertenrichment
WHERE (json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'true'
       OR json_extract_path_text(alertenrichment.enrichments, 'dismissed') = 'True')
  AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') IS NOT NULL
  AND json_extract_path_text(alertenrichment.enrichments, 'dismissUntil') != 'forever'
```

These queries execute successfully but the session is never closed/committed.

### 2. Pool exhaustion — only affects arqworker scheduler process

```
"worker_type": "arqworker"
"process": 684
"threadName": "WorkflowScheduler_0"

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
```

All pool errors originate from a single process (the arqworker running `SCHEDULER=true`). The 4 uvicorn workers and the second arqworker (consumer-only) are completely unaffected.

### 3. Arq job health — 28% failure rate

```
j_complete=781 j_failed=304 j_retried=56 j_ongoing=0
```

### 4. Call chain

```
arq cron (every 60s)
  → async_process_watcher (process_watcher_task.py:30)
    → run_in_executor → MaintenanceWindowsBl.recover_strategy (maintenance_windows_bl.py:184)
      → get_maintenance_windows_started (db.py:5952)
        → session.exec(query)  ← connection acquired, never returned
    → run_in_executor → DismissalExpiryBl.check_dismissal_expiry (dismissal_expiry_bl.py:144)
      → get_alerts_with_expired_dismissals (dismissal_expiry_bl.py:65)
        → session.exec(query)  ← connection acquired, never returned
```

The `process_event_worker_0` and `process_event_worker_1` threads execute their queries and return, but the session's transaction remains open. The connection is not returned to the pool.

## Impact

| Component | Status |
|-----------|--------|
| API / UI | Working (uvicorn workers have separate pools) |
| Alert ingestion | Working |
| WorkflowScheduler | **Broken** (can't query Workflow table) |
| Alert-triggered workflows | **Broken** |
| Cron workflows | **Broken** |

## Workaround

Setting PostgreSQL `idle_in_transaction_session_timeout` to 60s (from default 0 or higher values) causes the server to kill leaked connections before the next cron cycle creates new ones. Combined with `KEEP_DB_PRE_PING_ENABLED=true` so the pool detects killed connections.

This is a server-side bandaid — the application should properly close sessions.

## Related

- #5496 — reports pool exhaustion under high load (same root cause, different framing)
- #5480 — reports stale connection errors on `process_event_worker_1` (same thread)
- PR #5474 — attempted to add `pool_recycle`/`pool_timeout` config (closed, not merged)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: process_watcher_task leaks SQLAlchemy sessions — pool exhaustion in arqworker #5610

Description

Environment

Reproduction

Evidence

1. `pg_stat_activity` — leaked sessions

2. Pool exhaustion — only affects arqworker scheduler process

3. Arq job health — 28% failure rate

4. Call chain

Impact

Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Status
API / UI	Working (uvicorn workers have separate pools)
Alert ingestion	Working
WorkflowScheduler	Broken (can't query Workflow table)
Alert-triggered workflows	Broken
Cron workflows	Broken

[🐛 Bug]: process_watcher_task leaks SQLAlchemy sessions — pool exhaustion in arqworker #5610

Description

Description

Environment

Reproduction

Evidence

1. pg_stat_activity — leaked sessions

2. Pool exhaustion — only affects arqworker scheduler process

3. Arq job health — 28% failure rate

4. Call chain

Impact

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `pg_stat_activity` — leaked sessions