Skip to content

Fix SQS Listener Reconnection & EventBus Capacity Checks on NATS connection drops#14

Merged
abdulazillow merged 1 commit intofeature/zgfrom
dhanashrit/AIP-10025/fix-sqs-publish-errors-on-eventbus-leader-pod-restart
Jan 20, 2026
Merged

Fix SQS Listener Reconnection & EventBus Capacity Checks on NATS connection drops#14
abdulazillow merged 1 commit intofeature/zgfrom
dhanashrit/AIP-10025/fix-sqs-publish-errors-on-eventbus-leader-pod-restart

Conversation

@dhanashritidke11
Copy link
Copy Markdown
Collaborator

Fix: SQS Listener Reconnection & EventBus Capacity Checks

Summary

Fix SQS publish errors occurring when the EventBus (NATS) connection drops during pod restarts. This is achieved by keeping SQS listeners synced to reconnects and simplifying capacity check logic.


Context & Problem

Before

  • Stale Connections: SQS listeners held stale EventBus connection references after NATS restarts, leading to dispatch failures.
  • Strict Capacity Checks: Required a non-closed connection reference to validate capacity, which failed during transient disconnects.
  • Impact: Messages were redirected to the DLQ (Dead Letter Queue) after reaching maximum retries.

Error Log (Old Behavior)

{
   "cna": "aianalytics-sandbox-k8s-1",
   "ctr": "main/aip-synthetic-eventing-sqs-test-eventsource-j2g9h-5c7598cfs2c7s",
   "env": "aianalytics_sandbox",
   "lvl": "INFO",
   "msg": "2026-01-13T07:35:34.686Z ERROR argo-events.eventsource awssqs/start.go:288 failed to dispatch SQS event",
   "error": "failed after retries: nats: maximum messages exceeded",
   "pod": "aip-synthetic-eventing-sqs-test-eventsource-j2g9h-5c7598cfs2c7s"
}

Changes

  • Connection Refresh: Updated eventsources/eventing.go to track SQS listeners and refresh their EventBus connection upon reconnection.
  • Logic Simplification: Capacity checks in eventsources/sources/awssqs/start.go now only require a non-nil connection.
  • Log Level Update: Downgraded "MaxMsgs limit reached" from ERROR to DEBUG to reduce log noise.

Log Snippet (New Behavior)

{
   "env": "aianalytics_sandbox",
   "lvl": "INFO",
   "msg": "2026-01-20T19:44:12.102Z INFO argo-events.eventsource eventing.go reconnected to eventbus successfully",
   "eventSource": "aip-synthetic-eventing-sqs-test"
}
{
   "env": "aianalytics_sandbox",
   "lvl": "DEBUG",
   "msg": "2026-01-20T19:44:12.445Z DEBUG argo-events.eventsource awssqs/start.go EventBus at capacity - MaxMsgs limit reached",
   "eventSource": "aip-synthetic-eventing-sqs-test"
}

Testing Plan

  1. Initial State: Establish steady-state message flow from SQS to the EventBus.
  2. Trigger Failover: Restart the EventBus leader pod (NATS) to force a connection drop.
  3. Verify Reconnect: Monitor logs for successful reconnection in eventing.go.
  4. Validation: * Ensure SQS listeners successfully pick up the new connection.
  • Confirm that EventBus leader pod restarts result in successful capacity checks upon reconnection.
  • Verify that messages resume flowing without being redirected to the DLQ.

@dhanashritidke11 dhanashritidke11 self-assigned this Jan 20, 2026
@abdulazillow abdulazillow merged commit 381a9bf into feature/zg Jan 20, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants