Add autofix rule to remove duplicate column definitions by dataders · Pull Request #284 · dbt-labs/dbt-autofix

dataders · 2026-01-14T02:55:48Z

Summary

Adds a new changeset rule that removes duplicate column definitions in YAML schema files
Keeps the last occurrence to match dbt's runtime behavior ("only the last definition will be used")
Gated behind --behavior-change flag since removing duplicates may lose config information

Closes #283

Changes

Add DUPLICATE_COLUMN_DEFINITION_DEPRECATION type to deprecations.py
Add changeset_remove_duplicate_column_definitions function and deduplicate_columns_list helper
Handle columns in models, seeds, snapshots, model versions, and source tables
Register in behavior_change_rules (requires --behavior-change flag)
Add comprehensive test suite (11 tests)
Update README with new deprecation coverage

Test plan

All existing tests pass (109 tests)
New tests pass (11 tests for duplicate column removal)
Manual test with YAML file containing duplicate columns
Verify only runs with --behavior-change flag

🤖 Generated with Claude Code

Addresses #283. Adds a new changeset rule that removes duplicate column definitions in YAML schema files, keeping the last occurrence to match dbt's runtime behavior ("only the last definition will be used"). The rule is gated behind the --behavior-change flag since removing duplicates may lose config information (descriptions, tests, masking_policy, etc.) that differs between duplicate definitions. Changes: - Add DUPLICATE_COLUMN_DEFINITION_DEPRECATION type - Add changeset_remove_duplicate_column_definitions function - Handle columns in models, seeds, snapshots, model versions, and source tables - Add comprehensive test suite (11 tests) - Update README with new deprecation coverage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jairus-m · 2026-01-31T01:35:20Z

uv.lock

Was wondering, since you didn't add any new dependencies/changes to thepyproject.toml should this lockfile be excluded from this PR?

And for my learnings, do you happen to remember why it updated?

ah yeah great call. it was updated bc my local uv env was different than this package... i'm happy to delete, but likely this means some deps are worth updating?

Most def - the first thing that popped in my head was if it was intentional or not since I do believe that uv.lock would only be modified heavily with version bumps if something like uv lock --upgrade was ran during dev. If that was the intent, 100% good to keep!

But given the current existing file conflict maybe it's best to resolve that, sync uv.lock to main, commit that back to your branch, then re-run uv sync and see if anything changes?

There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.

Yep agreed, Chaya! On the main branch there is no lockfile drift right now.

dbt-autofix main ❯ uv sync --extra test Resolved 72 packages in 1ms Audited 71 packages in 0.42ms dbt-autofix main ❯ git status On branch main Your branch is up to date with 'origin/main'. nothing to commit, working tree clean

chayac

Love this, thank you for taking it on! Could you please add an example change to the integration tests too? I also made a note about reverting the lockfile changes.

chayac · 2026-02-02T18:17:18Z

uv.lock

There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.

chayac · 2026-02-02T18:19:37Z

@davidharting could you please take a look at this too?

davidharting

Changes look really good, thanks Anders!

Please:

Revert your change to lockfile
Get this branch up to date with main
uvx ruff@0.14.4 format
uvx ruff@0.14.14 check
Add an integration test

Thank you!

davidharting · 2026-02-02T21:10:26Z

README.md

 | `ResourceNamesWithSpacesDeprecation` | SQL files, YAML files | Replaces spaces with underscores in resource names, updating .sql filenames as necessary | Full | Yes |  
 | `SourceFreshnessProjectHooksNotRun` | `dbt_project.yml` | Set `source_freshness_run_project_hooks` in `dbt_project.yml` "flags" to true | Full | Yes |
 | `MissingArgumentsPropertyInGenericTestDeprecation` | YAML files | Move any keyword arguments defined as top-level property on generic test to `arguments` property | Full | No |
+| `DuplicateColumnDefinitionDeprecation` | YAML files | Remove duplicate column definitions in columns list, keeping the last occurrence | Full | Yes |


Keeping the last occurrence seems like a reasonable heuristic to me! Easy to automate, and very clear in the diff if a user wants to pick the other instead.

davidharting · 2026-02-02T21:32:38Z

src/dbt_autofix/refactors/changesets/dbt_schema_yml.py

+    columns: List[Dict[str, Any]],
+    parent_name: str,
+    parent_type: str
+) -> Tuple[List[Dict[str, Any]], bool, List[DbtDeprecationRefactor]]:


I would prefer a new Dataclass, frozen for the return type here. Less mental load to have named properties than a tuple to unpack imo.

I'm not sure exactly, but it may have been this article or this one that lead me to swear off tuples and named tuples in almost all situations.

davidharting · 2026-02-02T21:34:39Z

src/dbt_autofix/refactors/changesets/dbt_schema_yml.py

+
+    deprecation_refactors: List[DbtDeprecationRefactor] = []
+    seen_names: Dict[str, int] = {}  # name -> last index
+    duplicate_indices: set = set()


Could you type-hint this as set[int] to make it stricter on what the type-checker will allow us to do with the set?

davidharting · 2026-02-02T21:45:40Z

tests/unit_tests/test_refactor.py

+        id_column = [c for c in table["columns"] if c["name"] == "id"][0]
+        assert id_column["description"] == "Second definition"
+
+    def test_triple_duplicate_column(self, schema_specs: SchemaSpecs):


missed opportunity to use the word triplicate!

davidharting · 2026-02-02T21:46:28Z

tests/unit_tests/test_refactor.py

+"""
+        result = changeset_remove_duplicate_column_definitions(input_yaml, schema_specs)
+        assert result.refactored
+        assert len(result.deprecation_refactors) == 2  # Two duplicates removed


This is making me realize - do we want refactors to count the number of duplicates removed, or the number of unique columns dealt with?

I'm not sure which is more useful information to the user or if it matters

davidharting · 2026-02-03T00:05:53Z

I'm also happy to take this over if you don't have bandwidth for those follow-ups!

dataders requested a review from chayac as a code owner January 14, 2026 02:55

jairus-m reviewed Jan 31, 2026

View reviewed changes

chayac requested changes Feb 2, 2026

View reviewed changes

davidharting self-requested a review February 2, 2026 20:17

davidharting requested changes Feb 2, 2026

View reviewed changes

Conversation

dataders commented Jan 14, 2026

Summary

Changes

Test plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chayac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chayac commented Feb 2, 2026

Uh oh!

davidharting left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidharting commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidharting left a comment •

edited

Loading