-
Notifications
You must be signed in to change notification settings - Fork 134
fix(subtract): make it work according docs #1569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ] | ||
|
|
||
|
|
||
| def test_subtract(test_session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C: these three tests moved as is
73412e8 to
f7b5c16
Compare
Deploying datachain with
|
| Latest commit: |
a1deac2
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://af8ea7af.datachain-2g6.pages.dev |
| Branch Preview URL: | https://fix-subtract.datachain-2g6.pages.dev |
|
|
||
| dr = self.catalog.warehouse.dataset_rows(self.dataset, self.dataset_version) | ||
| # Use a short alias with dataset ID suffix for uniqueness and SQL brevity | ||
| ds_id = dr.table.name.rsplit("_", 1)[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C: this hopefully should reduce all queries length a lot
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes the subtract operation which was incorrectly including system columns (like sys__rand) in row comparisons, leading to unpredictable results. The fix excludes system columns from the comparison and uses an anti-join pattern instead of the previous EXCEPT-based approach.
Changes:
- Refactored
subtractimplementation to use LEFT JOIN anti-join pattern with system column exclusion - Moved and expanded test coverage to dedicated test file
tests/unit/lib/test_subtract.py - Added table aliasing for cleaner SQL generation and improved performance
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tests/unit/lib/test_subtract.py |
New comprehensive test file with multiple test cases covering duplicates, chaining, edge cases, and sys column preservation |
tests/unit/lib/test_datachain.py |
Removed subtract tests that were moved to dedicated file |
src/datachain/query/dataset.py |
Added table aliasing, excluded sys columns from target query in subtract, normalized key pairs handling |
src/datachain/data_storage/warehouse.py |
Implemented default subtract_query method using anti-join pattern with CTEs and unique naming via counter |
pyproject.toml |
Added --ignore=local to pytest options to skip local development folder |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f7b5c16 to
fc53ac9
Compare
dreadatour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, only tiny comment 👍
fc53ac9 to
3da5bdb
Compare
3da5bdb to
a1deac2
Compare
Fixes
subtract.We had an issue where were not properly comparing items to subtract (not only
oncolumns, but on all columns in reality). E.g.sys__randwas included. It means if source query had a few matching rows but with a different sys__rand we were never really subtracting it. It was leading to random / unpredictable results in production.Key Changes
EXCEPTwithNOT EXISTScorrelated subquery for ClickHouseLEFT JOIN ... WHERE IS NULLfor SQLite (base implementation)__dc_src_cte_*,__dc_tgt_cte_*) to avoid repeated subqueries__ds_t_*) to reduce SQL verbosityBefore / after comparison
Note: OLD results are even worse in reality, perf script is monkey patching only certain parts to the existing implementation
Execution Time (15,000 rows dataset):
Generated SQL Size (diff query)