Skip to content

Conversation

ilidemi
Copy link
Contributor

@ilidemi ilidemi commented Jul 25, 2025

Track snapshot/sync/normalize/slot lag status in a granular way - sync and normalize can fail and recover independently, QRep runs and partitions can fail and recover independently. It will eventually get rolled up into mirror-level status:degraded.

Every error in Snapshot and running CDC flow is now reported to flow_errors. Calling alerter logging is unified to the top level activity level rather than being sprinkled in across the code.

Todo:

  • Add the extra lookup fields and indices into flow_errors
  • Double check PG writes are reliable
  • Report the new status and errors in MirrorStatus
  • Double check the lifecycles of status values
  • What will happen when something is erroring out then there's a signal to switch, will there be a blip of stale status in the future
  • Why are we treating ApplicationErrors and replState changed/slot is already active in a special way?
  • Either log internal errors as internal errors and remove _is_internal_error or filter them out and use the field
  • Slot lag threshold should be coming from the individual overrides in catalog
  • Thresholding like IMR does - either move it over or integrate
  • Testing

@ilidemi ilidemi requested review from iamKunalGupta and serprex July 25, 2025 10:02
return flowStatus, nil
}

func UpdateFlowStatusInCatalog(ctx context.Context, pool shared.CatalogPool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be local activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants