[Feature] Safeguard to never allow a duplicate `dbt_scd_id` row for snapshots

Originally posted by @ataft in [#11235](https://github.com/dbt-labs/dbt-core/issues/11235#issuecomment-2950318934):

> @dbeatty10  This issue continues to pop up for us, but how the duplicate rows gets in there cannot be reproduced. For us, we started noticing it on CI/CD jobs, possibly a race condition because CI/CD jobs do not use a specific PR schema for snapshots like everything else does. However, we removed snapshots from our CI process and continue to see this issue. Lengthy back and forth with dbt support through an Enterprise account yielded no solutions either.
> 
> At the end of the day, if you search for duplicate row issues with snapshots, there are more than a dozen separate issues, most of them ending in non-reproducible. Is there enough community messaging to know that there are bugs with the dbt snapshots that allow duplicate rows to be inserted? From my perspective, dbt needs to put in a safeguard to never allow a duplicate `dbt_scd_id` row. Is that a possibility?

### Known causes
There are two known cases that can lead to duplicate rows for snapshots:
1. when the source data contains duplicate data
1. concurrent executions of `dbt snapshot`

There are other known causes for additional (but not truly duplicate) rows
1. `new_record` mode + custom `dbt_valid_to_current` (fixed in [#1241](https://github.com/dbt-labs/dbt-adapters/pull/1241))
1. duplicate column names in the `check_cols` configuration ([dbt-core #9656](https://github.com/dbt-labs/dbt-core/issues/9656)).

### Recommendations to prevent
1. add a `uniqueness` data test to confirm the source data does not contain duplicate data
1. avoid concurrent executions of `dbt snapshot`
1. when possible, use the `timestamp` strategy with an `updated_at` column. This makes `dbt_scd_id` deterministic ([dbt-labs/dbt-core#4661 (comment)](https://github.com/dbt-labs/dbt-core/issues/4661#issuecomment-2638157625))

### Workarounds

1. add a deduplication step in a post-hook ([dbt-labs/dbt-core#11235 (comment)](https://github.com/dbt-labs/dbt-core/issues/11235#issuecomment-3473935765))
1. for Snowflake: override `snapshot_merge_sql` to add a post-MERGE duplicate check with rollback ([dbt-labs/dbt-core#4661 (comment)](https://github.com/dbt-labs/dbt-core/issues/4661#issuecomment-2638157625))

### Feature request

There is a feature request in https://github.com/dbt-labs/dbt-core/issues/10236 to validate uniqueness before merge (configurable on/off). I believe this would prevent duplicates when the `unique_key` is not actually unique (by failing the build). But it would not prevent duplicates during concurrent runs of `dbt snapshot`.

This is a separate feature request to put something in place to make it **impossible** for there to be duplicate `dbt_scd_id` rows, even if there are concurrent executions of `dbt snapshot`.

### Potential implementation

**Make `dbt_scd_id` deterministic for the `check` strategy** (e.g., hash `unique_key` + `check_cols` instead of `unique_key` + `{{ current_timestamp() }}` ([dbt-adapters #1224  (comment)](https://github.com/dbt-labs/dbt-adapters/issues/1224#issuecomment-3103132615))).

We need to check how it would behave in the following situation:
1. 1st snapshot with row A present in the data set
1. 2nd snapshot with row A removed from the data set
1. 3rd snapshot with row A returned to the data set

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Safeguard to never allow a duplicate `dbt_scd_id` row for snapshots #1759

Known causes

Recommendations to prevent

Workarounds

Feature request

Potential implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Safeguard to never allow a duplicate dbt_scd_id row for snapshots #1759

Description

Known causes

Recommendations to prevent

Workarounds

Feature request

Potential implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature] Safeguard to never allow a duplicate `dbt_scd_id` row for snapshots #1759