Skip to content

Add autofix rule to remove duplicate column definitions#284

Open
dataders wants to merge 1 commit intomainfrom
fix/duplicate-column-definitions
Open

Add autofix rule to remove duplicate column definitions#284
dataders wants to merge 1 commit intomainfrom
fix/duplicate-column-definitions

Conversation

@dataders
Copy link
Contributor

Summary

  • Adds a new changeset rule that removes duplicate column definitions in YAML schema files
  • Keeps the last occurrence to match dbt's runtime behavior ("only the last definition will be used")
  • Gated behind --behavior-change flag since removing duplicates may lose config information

Closes #283

Changes

  • Add DUPLICATE_COLUMN_DEFINITION_DEPRECATION type to deprecations.py
  • Add changeset_remove_duplicate_column_definitions function and deduplicate_columns_list helper
  • Handle columns in models, seeds, snapshots, model versions, and source tables
  • Register in behavior_change_rules (requires --behavior-change flag)
  • Add comprehensive test suite (11 tests)
  • Update README with new deprecation coverage

Test plan

  • All existing tests pass (109 tests)
  • New tests pass (11 tests for duplicate column removal)
  • Manual test with YAML file containing duplicate columns
  • Verify only runs with --behavior-change flag

🤖 Generated with Claude Code

Addresses #283. Adds a new changeset rule that removes duplicate column
definitions in YAML schema files, keeping the last occurrence to match
dbt's runtime behavior ("only the last definition will be used").

The rule is gated behind the --behavior-change flag since removing
duplicates may lose config information (descriptions, tests,
masking_policy, etc.) that differs between duplicate definitions.

Changes:
- Add DUPLICATE_COLUMN_DEFINITION_DEPRECATION type
- Add changeset_remove_duplicate_column_definitions function
- Handle columns in models, seeds, snapshots, model versions, and
  source tables
- Add comprehensive test suite (11 tests)
- Update README with new deprecation coverage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dataders dataders requested a review from chayac as a code owner January 14, 2026 02:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering, since you didn't add any new dependencies/changes to thepyproject.toml should this lockfile be excluded from this PR?

And for my learnings, do you happen to remember why it updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah great call. it was updated bc my local uv env was different than this package... i'm happy to delete, but likely this means some deps are worth updating?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most def - the first thing that popped in my head was if it was intentional or not since I do believe that uv.lock would only be modified heavily with version bumps if something like uv lock --upgrade was ran during dev. If that was the intent, 100% good to keep!

But given the current existing file conflict maybe it's best to resolve that, sync uv.lock to main, commit that back to your branch, then re-run uv sync and see if anything changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed, Chaya! On the main branch there is no lockfile drift right now.

dbt-autofix main
 ❯ uv sync --extra test
Resolved 72 packages in 1ms
Audited 71 packages in 0.42ms

dbt-autofix main
 ❯ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

Copy link
Collaborator

@chayac chayac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this, thank you for taking it on! Could you please add an example change to the integration tests too? I also made a note about reverting the lockfile changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.

@chayac
Copy link
Collaborator

chayac commented Feb 2, 2026

@davidharting could you please take a look at this too?

@davidharting davidharting self-requested a review February 2, 2026 20:17
Copy link
Collaborator

@davidharting davidharting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look really good, thanks Anders!

Please:

  • Revert your change to lockfile
  • Get this branch up to date with main
  • uvx ruff@0.14.4 format
  • uvx ruff@0.14.14 check
  • Add an integration test

Thank you!

| `ResourceNamesWithSpacesDeprecation` | SQL files, YAML files | Replaces spaces with underscores in resource names, updating .sql filenames as necessary | Full | Yes |
| `SourceFreshnessProjectHooksNotRun` | `dbt_project.yml` | Set `source_freshness_run_project_hooks` in `dbt_project.yml` "flags" to true | Full | Yes |
| `MissingArgumentsPropertyInGenericTestDeprecation` | YAML files | Move any keyword arguments defined as top-level property on generic test to `arguments` property | Full | No |
| `DuplicateColumnDefinitionDeprecation` | YAML files | Remove duplicate column definitions in columns list, keeping the last occurrence | Full | Yes |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the last occurrence seems like a reasonable heuristic to me! Easy to automate, and very clear in the diff if a user wants to pick the other instead.

columns: List[Dict[str, Any]],
parent_name: str,
parent_type: str
) -> Tuple[List[Dict[str, Any]], bool, List[DbtDeprecationRefactor]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a new Dataclass, frozen for the return type here. Less mental load to have named properties than a tuple to unpack imo.

I'm not sure exactly, but it may have been this article or this one that lead me to swear off tuples and named tuples in almost all situations.


deprecation_refactors: List[DbtDeprecationRefactor] = []
seen_names: Dict[str, int] = {} # name -> last index
duplicate_indices: set = set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you type-hint this as set[int] to make it stricter on what the type-checker will allow us to do with the set?

id_column = [c for c in table["columns"] if c["name"] == "id"][0]
assert id_column["description"] == "Second definition"

def test_triple_duplicate_column(self, schema_specs: SchemaSpecs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed opportunity to use the word triplicate!

"""
result = changeset_remove_duplicate_column_definitions(input_yaml, schema_specs)
assert result.refactored
assert len(result.deprecation_refactors) == 2 # Two duplicates removed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making me realize - do we want refactors to count the number of duplicates removed, or the number of unique columns dealt with?

I'm not sure which is more useful information to the user or if it matters

@davidharting
Copy link
Collaborator

I'm also happy to take this over if you don't have bandwidth for those follow-ups!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

resolve redundancy of column keys from dbt yaml

4 participants