Skip to content

[Feature] Safeguard to never allow a duplicate dbt_scd_id row for snapshots #1759

@dbeatty10

Description

@dbeatty10

Originally posted by @ataft in #11235:

@dbeatty10 This issue continues to pop up for us, but how the duplicate rows gets in there cannot be reproduced. For us, we started noticing it on CI/CD jobs, possibly a race condition because CI/CD jobs do not use a specific PR schema for snapshots like everything else does. However, we removed snapshots from our CI process and continue to see this issue. Lengthy back and forth with dbt support through an Enterprise account yielded no solutions either.

At the end of the day, if you search for duplicate row issues with snapshots, there are more than a dozen separate issues, most of them ending in non-reproducible. Is there enough community messaging to know that there are bugs with the dbt snapshots that allow duplicate rows to be inserted? From my perspective, dbt needs to put in a safeguard to never allow a duplicate dbt_scd_id row. Is that a possibility?

Known causes

There are two known cases that can lead to duplicate rows for snapshots:

  1. when the source data contains duplicate data
  2. concurrent executions of dbt snapshot

There are other known causes for additional (but not truly duplicate) rows

  1. new_record mode + custom dbt_valid_to_current (fixed in #1241)
  2. duplicate column names in the check_cols configuration (dbt-core #9656).

Recommendations to prevent

  1. add a uniqueness data test to confirm the source data does not contain duplicate data
  2. avoid concurrent executions of dbt snapshot
  3. when possible, use the timestamp strategy with an updated_at column. This makes dbt_scd_id deterministic (dbt-labs/dbt-core#4661 (comment))

Workarounds

  1. add a deduplication step in a post-hook (dbt-labs/dbt-core#11235 (comment))
  2. for Snowflake: override snapshot_merge_sql to add a post-MERGE duplicate check with rollback (dbt-labs/dbt-core#4661 (comment))

Feature request

There is a feature request in dbt-labs/dbt-core#10236 to validate uniqueness before merge (configurable on/off). I believe this would prevent duplicates when the unique_key is not actually unique (by failing the build). But it would not prevent duplicates during concurrent runs of dbt snapshot.

This is a separate feature request to put something in place to make it impossible for there to be duplicate dbt_scd_id rows, even if there are concurrent executions of dbt snapshot.

Potential implementation

Make dbt_scd_id deterministic for the check strategy (e.g., hash unique_key + check_cols instead of unique_key + {{ current_timestamp() }} (dbt-adapters #1224 (comment))).

We need to check how it would behave in the following situation:

  1. 1st snapshot with row A present in the data set
  2. 2nd snapshot with row A removed from the data set
  3. 3rd snapshot with row A returned to the data set

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature:snapshotsIssues related to the snapshot materializationtriage:productIn Product's queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions