Skip to content

Improve storage efficiency by deleting unnecessary source data #203

@witash

Description

@witash

Background

CHT Sync currently requires a full copy of all synced CouchDB databases in PostgreSQL. This leads to high storage usage, consuming a significant fraction of the live database size.

Once data is moved from the source table into dbt model tables, some of it is no longer needed in the source table. Deleting this unnecessary data could free up substantial storage space.

However, ensuring accuracy and avoiding disruptions to model development while deleting data poses challenges.

Key Questions

  1. How can we determine which data is safe to delete?
    Once data_records and contacts are moved to the form and contact models, all downstream models (dashboards, aggregates) are expected to query from these models rather than the source table.

    • If this behavior can be guaranteed and form and contact models change infrequently, records in the source table could be deleted once these models are populated.
    • A similar approach might work for tasks, where only a small amount of metadata is needed while most source data is operational and unnecessary in PostgreSQL.
  2. How do we handle changes to models that depend on the source table?
    couch2pg lacks a mechanism to selectively re-add deleted records. If models that rely on the source table need to be rebuilt, but the original data has been deleted, there must be a way to restore it.

  3. How can we prevent users from querying the source table directly?
    While aggregates and dashboard models are expected to query from form and contact models, nothing technically prevents developers from querying the source table directly.

    • If source data is deleted, such queries would produce incorrect results silently.
    • How can we enforce best practices and prevent accidental reliance on deleted source data?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions