-
Notifications
You must be signed in to change notification settings - Fork 286
Description
Originally posted by @ataft in #11235:
@dbeatty10 This issue continues to pop up for us, but how the duplicate rows gets in there cannot be reproduced. For us, we started noticing it on CI/CD jobs, possibly a race condition because CI/CD jobs do not use a specific PR schema for snapshots like everything else does. However, we removed snapshots from our CI process and continue to see this issue. Lengthy back and forth with dbt support through an Enterprise account yielded no solutions either.
At the end of the day, if you search for duplicate row issues with snapshots, there are more than a dozen separate issues, most of them ending in non-reproducible. Is there enough community messaging to know that there are bugs with the dbt snapshots that allow duplicate rows to be inserted? From my perspective, dbt needs to put in a safeguard to never allow a duplicate
dbt_scd_idrow. Is that a possibility?
Known causes
There are two known cases that can lead to duplicate rows for snapshots:
- when the source data contains duplicate data
- concurrent executions of
dbt snapshot
There are other known causes for additional (but not truly duplicate) rows
new_recordmode + customdbt_valid_to_current(fixed in #1241)- duplicate column names in the
check_colsconfiguration (dbt-core #9656).
Recommendations to prevent
- add a
uniquenessdata test to confirm the source data does not contain duplicate data - avoid concurrent executions of
dbt snapshot - when possible, use the
timestampstrategy with anupdated_atcolumn. This makesdbt_scd_iddeterministic (dbt-labs/dbt-core#4661 (comment))
Workarounds
- add a deduplication step in a post-hook (dbt-labs/dbt-core#11235 (comment))
- for Snowflake: override
snapshot_merge_sqlto add a post-MERGE duplicate check with rollback (dbt-labs/dbt-core#4661 (comment))
Feature request
There is a feature request in dbt-labs/dbt-core#10236 to validate uniqueness before merge (configurable on/off). I believe this would prevent duplicates when the unique_key is not actually unique (by failing the build). But it would not prevent duplicates during concurrent runs of dbt snapshot.
This is a separate feature request to put something in place to make it impossible for there to be duplicate dbt_scd_id rows, even if there are concurrent executions of dbt snapshot.
Potential implementation
Make dbt_scd_id deterministic for the check strategy (e.g., hash unique_key + check_cols instead of unique_key + {{ current_timestamp() }} (dbt-adapters #1224 (comment))).
We need to check how it would behave in the following situation:
- 1st snapshot with row A present in the data set
- 2nd snapshot with row A removed from the data set
- 3rd snapshot with row A returned to the data set