Add autofix rule to remove duplicate column definitions#284
Add autofix rule to remove duplicate column definitions#284
Conversation
Addresses #283. Adds a new changeset rule that removes duplicate column definitions in YAML schema files, keeping the last occurrence to match dbt's runtime behavior ("only the last definition will be used"). The rule is gated behind the --behavior-change flag since removing duplicates may lose config information (descriptions, tests, masking_policy, etc.) that differs between duplicate definitions. Changes: - Add DUPLICATE_COLUMN_DEFINITION_DEPRECATION type - Add changeset_remove_duplicate_column_definitions function - Handle columns in models, seeds, snapshots, model versions, and source tables - Add comprehensive test suite (11 tests) - Update README with new deprecation coverage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Was wondering, since you didn't add any new dependencies/changes to thepyproject.toml should this lockfile be excluded from this PR?
And for my learnings, do you happen to remember why it updated?
There was a problem hiding this comment.
ah yeah great call. it was updated bc my local uv env was different than this package... i'm happy to delete, but likely this means some deps are worth updating?
There was a problem hiding this comment.
Most def - the first thing that popped in my head was if it was intentional or not since I do believe that uv.lock would only be modified heavily with version bumps if something like uv lock --upgrade was ran during dev. If that was the intent, 100% good to keep!
But given the current existing file conflict maybe it's best to resolve that, sync uv.lock to main, commit that back to your branch, then re-run uv sync and see if anything changes?
There was a problem hiding this comment.
There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.
There was a problem hiding this comment.
Yep agreed, Chaya! On the main branch there is no lockfile drift right now.
dbt-autofix main
❯ uv sync --extra test
Resolved 72 packages in 1ms
Audited 71 packages in 0.42ms
dbt-autofix main
❯ git status
On branch main
Your branch is up to date with 'origin/main'.
nothing to commit, working tree clean
chayac
left a comment
There was a problem hiding this comment.
Love this, thank you for taking it on! Could you please add an example change to the integration tests too? I also made a note about reverting the lockfile changes.
There was a problem hiding this comment.
There shouldn't be any changes to the lockfile since no dependencies were added so you can just revert this change.
|
@davidharting could you please take a look at this too? |
| | `ResourceNamesWithSpacesDeprecation` | SQL files, YAML files | Replaces spaces with underscores in resource names, updating .sql filenames as necessary | Full | Yes | | ||
| | `SourceFreshnessProjectHooksNotRun` | `dbt_project.yml` | Set `source_freshness_run_project_hooks` in `dbt_project.yml` "flags" to true | Full | Yes | | ||
| | `MissingArgumentsPropertyInGenericTestDeprecation` | YAML files | Move any keyword arguments defined as top-level property on generic test to `arguments` property | Full | No | | ||
| | `DuplicateColumnDefinitionDeprecation` | YAML files | Remove duplicate column definitions in columns list, keeping the last occurrence | Full | Yes | |
There was a problem hiding this comment.
Keeping the last occurrence seems like a reasonable heuristic to me! Easy to automate, and very clear in the diff if a user wants to pick the other instead.
| columns: List[Dict[str, Any]], | ||
| parent_name: str, | ||
| parent_type: str | ||
| ) -> Tuple[List[Dict[str, Any]], bool, List[DbtDeprecationRefactor]]: |
There was a problem hiding this comment.
I would prefer a new Dataclass, frozen for the return type here. Less mental load to have named properties than a tuple to unpack imo.
I'm not sure exactly, but it may have been this article or this one that lead me to swear off tuples and named tuples in almost all situations.
|
|
||
| deprecation_refactors: List[DbtDeprecationRefactor] = [] | ||
| seen_names: Dict[str, int] = {} # name -> last index | ||
| duplicate_indices: set = set() |
There was a problem hiding this comment.
Could you type-hint this as set[int] to make it stricter on what the type-checker will allow us to do with the set?
| id_column = [c for c in table["columns"] if c["name"] == "id"][0] | ||
| assert id_column["description"] == "Second definition" | ||
|
|
||
| def test_triple_duplicate_column(self, schema_specs: SchemaSpecs): |
There was a problem hiding this comment.
missed opportunity to use the word triplicate!
| """ | ||
| result = changeset_remove_duplicate_column_definitions(input_yaml, schema_specs) | ||
| assert result.refactored | ||
| assert len(result.deprecation_refactors) == 2 # Two duplicates removed |
There was a problem hiding this comment.
This is making me realize - do we want refactors to count the number of duplicates removed, or the number of unique columns dealt with?
I'm not sure which is more useful information to the user or if it matters
|
I'm also happy to take this over if you don't have bandwidth for those follow-ups! |
Summary
--behavior-changeflag since removing duplicates may lose config informationCloses #283
Changes
DUPLICATE_COLUMN_DEFINITION_DEPRECATIONtype todeprecations.pychangeset_remove_duplicate_column_definitionsfunction anddeduplicate_columns_listhelperbehavior_change_rules(requires--behavior-changeflag)Test plan
--behavior-changeflag🤖 Generated with Claude Code