fix(subtract): make it work according docs #1569

shcheklein · 2026-01-23T19:16:32Z

Fixes subtract.

We had an issue where were not properly comparing items to subtract (not only on columns, but on all columns in reality). E.g. sys__rand was included. It means if source query had a few matching rows but with a different sys__rand we were never really subtracting it. It was leading to random / unpredictable results in production.

Key Changes

Replaced EXCEPT with NOT EXISTS correlated subquery for ClickHouse
Replaced LEFT JOIN ... WHERE IS NULL for SQLite (base implementation)
Used CTEs (__dc_src_cte_*, __dc_tgt_cte_*) to avoid repeated subqueries
Added table aliasing (__ds_t_*) to reduce SQL verbosity

Before / after comparison

Note: OLD results are even worse in reality, perf script is monkey patching only certain parts to the existing implementation

Execution Time (15,000 rows dataset):

Backend	Old (EXCEPT)	New (Anti-Join)	Speedup
ClickHouse	3.129s	1.860s	1.68x faster
SQLite	0.968s	0.684s	1.42x faster

Generated SQL Size (diff query)

Backend	Old (EXCEPT)	New (Anti-Join)	Reduction
ClickHouse	1,033 chars	797 chars	23% smaller
SQLite	1,043 chars	911 chars	13% smaller

shcheklein · 2026-01-23T19:17:31Z

tests/unit/lib/test_datachain.py

    ]


-def test_subtract(test_session):


C: these three tests moved as is

cloudflare-workers-and-pages · 2026-01-23T19:25:23Z

Deploying datachain with Cloudflare Pages

Latest commit:	`a1deac2`
Status:	✅ Deploy successful!
Preview URL:	https://af8ea7af.datachain-2g6.pages.dev
Branch Preview URL:	https://fix-subtract.datachain-2g6.pages.dev

View logs

shcheklein · 2026-01-23T19:36:37Z

src/datachain/query/dataset.py


        dr = self.catalog.warehouse.dataset_rows(self.dataset, self.dataset_version)
+        # Use a short alias with dataset ID suffix for uniqueness and SQL brevity
+        ds_id = dr.table.name.rsplit("_", 1)[-1]


C: this hopefully should reduce all queries length a lot

codecov · 2026-01-23T19:37:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR fixes the subtract operation which was incorrectly including system columns (like sys__rand) in row comparisons, leading to unpredictable results. The fix excludes system columns from the comparison and uses an anti-join pattern instead of the previous EXCEPT-based approach.

Changes:

Refactored subtract implementation to use LEFT JOIN anti-join pattern with system column exclusion
Moved and expanded test coverage to dedicated test file tests/unit/lib/test_subtract.py
Added table aliasing for cleaner SQL generation and improved performance

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/unit/lib/test_subtract.py`	New comprehensive test file with multiple test cases covering duplicates, chaining, edge cases, and sys column preservation
`tests/unit/lib/test_datachain.py`	Removed subtract tests that were moved to dedicated file
`src/datachain/query/dataset.py`	Added table aliasing, excluded sys columns from target query in subtract, normalized key pairs handling
`src/datachain/data_storage/warehouse.py`	Implemented default `subtract_query` method using anti-join pattern with CTEs and unique naming via counter
`pyproject.toml`	Added `--ignore=local` to pytest options to skip local development folder

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/lib/test_subtract.py

dreadatour

Looks good to me, only tiny comment 👍

src/datachain/query/dataset.py

shcheklein self-assigned this Jan 23, 2026

shcheklein commented Jan 23, 2026

View reviewed changes

tests/unit/lib/test_datachain.py

]

def test_subtract(test_session):

Copy link

Contributor Author

shcheklein Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C: these three tests moved as is

shcheklein force-pushed the fix-subtract branch from 73412e8 to f7b5c16 Compare January 23, 2026 19:25

shcheklein commented Jan 23, 2026

View reviewed changes

shcheklein requested a review from Copilot January 23, 2026 19:41

Copilot started reviewing on behalf of shcheklein January 23, 2026 19:41 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

tests/unit/lib/test_subtract.py Show resolved Hide resolved

shcheklein force-pushed the fix-subtract branch from f7b5c16 to fc53ac9 Compare January 23, 2026 23:50

shcheklein requested a review from a team January 24, 2026 01:00

dreadatour approved these changes Jan 24, 2026

View reviewed changes

src/datachain/query/dataset.py Outdated Show resolved Hide resolved

shcheklein force-pushed the fix-subtract branch from fc53ac9 to 3da5bdb Compare January 24, 2026 23:47

fix(subtract): make it work according docs

a1deac2

shcheklein force-pushed the fix-subtract branch from 3da5bdb to a1deac2 Compare January 24, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subtract): make it work according docs #1569

fix(subtract): make it work according docs #1569

Uh oh!

shcheklein commented Jan 23, 2026

Uh oh!

shcheklein Jan 23, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

shcheklein Jan 23, 2026

Uh oh!

codecov bot commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

dreadatour left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(subtract): make it work according docs #1569

Are you sure you want to change the base?

fix(subtract): make it work according docs #1569

Uh oh!

Conversation

shcheklein commented Jan 23, 2026

Key Changes

Before / after comparison

Generated SQL Size (diff query)

Uh oh!

shcheklein Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain with Cloudflare Pages

Uh oh!

shcheklein Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 23, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

dreadatour left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloudflare-workers-and-pages bot commented Jan 23, 2026 •

edited

Loading