Skip to content

Conversation

@shcheklein
Copy link
Contributor

Fixes subtract.

We had an issue where were not properly comparing items to subtract (not only on columns, but on all columns in reality). E.g. sys__rand was included. It means if source query had a few matching rows but with a different sys__rand we were never really subtracting it. It was leading to random / unpredictable results in production.

Key Changes

  • Replaced EXCEPT with NOT EXISTS correlated subquery for ClickHouse
  • Replaced LEFT JOIN ... WHERE IS NULL for SQLite (base implementation)
  • Used CTEs (__dc_src_cte_*, __dc_tgt_cte_*) to avoid repeated subqueries
  • Added table aliasing (__ds_t_*) to reduce SQL verbosity

Before / after comparison

Note: OLD results are even worse in reality, perf script is monkey patching only certain parts to the existing implementation

Execution Time (15,000 rows dataset):

Backend Old (EXCEPT) New (Anti-Join) Speedup
ClickHouse 3.129s 1.860s 1.68x faster
SQLite 0.968s 0.684s 1.42x faster

Generated SQL Size (diff query)

Backend Old (EXCEPT) New (Anti-Join) Reduction
ClickHouse 1,033 chars 797 chars 23% smaller
SQLite 1,043 chars 911 chars 13% smaller

@shcheklein shcheklein self-assigned this Jan 23, 2026
]


def test_subtract(test_session):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C: these three tests moved as is

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 23, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: a1deac2
Status: ✅  Deploy successful!
Preview URL: https://af8ea7af.datachain-2g6.pages.dev
Branch Preview URL: https://fix-subtract.datachain-2g6.pages.dev

View logs


dr = self.catalog.warehouse.dataset_rows(self.dataset, self.dataset_version)
# Use a short alias with dataset ID suffix for uniqueness and SQL brevity
ds_id = dr.table.name.rsplit("_", 1)[-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C: this hopefully should reduce all queries length a lot

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the subtract operation which was incorrectly including system columns (like sys__rand) in row comparisons, leading to unpredictable results. The fix excludes system columns from the comparison and uses an anti-join pattern instead of the previous EXCEPT-based approach.

Changes:

  • Refactored subtract implementation to use LEFT JOIN anti-join pattern with system column exclusion
  • Moved and expanded test coverage to dedicated test file tests/unit/lib/test_subtract.py
  • Added table aliasing for cleaner SQL generation and improved performance

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/lib/test_subtract.py New comprehensive test file with multiple test cases covering duplicates, chaining, edge cases, and sys column preservation
tests/unit/lib/test_datachain.py Removed subtract tests that were moved to dedicated file
src/datachain/query/dataset.py Added table aliasing, excluded sys columns from target query in subtract, normalized key pairs handling
src/datachain/data_storage/warehouse.py Implemented default subtract_query method using anti-join pattern with CTEs and unique naming via counter
pyproject.toml Added --ignore=local to pytest options to skip local development folder

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shcheklein shcheklein requested a review from a team January 24, 2026 01:00
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, only tiny comment 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants