Skip to content

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Sep 7, 2025

Updated: might also fix #722

When we run multiple joins, within the same chain, due to a recursive line:

temp_tables.extend(dq.temp_table_names)

in SQLJoin (link)

we might end up with a list of 8K+ items, with a lot a lot of duplicates.

It means query can run very long at the end.

Script to reproduce this. Mind we run show and save at the end, essentially also means we are doubling the list.

from dotenv import load_dotenv

import datachain as dc
from datachain import C, func


load_dotenv("local/.env.test")

all_files = dc.read_storage("gs://datachain-demo", anon=False).mutate(s3_path=dc.C("file.path")).persist()

laion_files = all_files.filter(dc.C("file.path").glob("50k-laion-files/*")).mutate(laion_file=dc.C("file"))
aspset510_files = all_files.filter(dc.C("file.path").glob("aspset510/*")).mutate(aspset510_file=dc.C("file"))
coco2017_files = all_files.filter(dc.C("file.path").glob("coco2017/*")).mutate(coco2017_file=dc.C("file"))
datacomp_small_files = all_files.filter(dc.C("file.path").glob("datacomp-small/*")).mutate(datacomp_small_file=dc.C("file"))
open_images_v6_files = all_files.filter(dc.C("file.path").glob("open-images-v6/*")).mutate(open_images_v6_file=dc.C("file"))
nlp_cnn_stories_files = all_files.filter(dc.C("file.path").glob("nlp-cnn-stories/*")).mutate(nlp_cnn_stories_file=dc.C("file"))

raw_data = (
    laion_files
    .merge(aspset510_files, on="s3_path", inner=True)
    .merge(coco2017_files, on="s3_path", inner=True)
    .merge(datacomp_small_files, on="s3_path", inner=True)
    .merge(open_images_v6_files, on="s3_path", inner=True)
    .merge(nlp_cnn_stories_files, on="s3_path", inner=False)
    .select(
        "s3_path",
        "laion_file",
        "aspset510_file",
        "coco2017_file",
        "datacomp_small_file",
        "open_images_v6_file",
        "nlp_cnn_stories_file",
    )
)

raw_data.save("datachain-demo-merge")
raw_data.show(1000)

TODO:

  • Run all tests
  • Add description
  • Add proper tests
  • Confirm semantics one more time

Copy link
Contributor

sourcery-ai bot commented Sep 7, 2025

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

When cloning a dataset query object, the temporary table name list is reset to avoid sharing state across instances by assigning a new empty list in the clone method.

Class diagram for updated clone method in dataset query

classDiagram
    class DatasetQuery {
        steps
        table
        temp_table_names
        clone(new_table=True)
    }
    DatasetQuery : clone() resets temp_table_names to []
Loading

File-Level Changes

Change Details Files
Isolate temp_table_names in clone
  • Add assignment of empty list to temp_table_names in clone()
src/datachain/query/dataset.py

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

cloudflare-workers-and-pages bot commented Sep 7, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 33e80f8
Status: ✅  Deploy successful!
Preview URL: https://bf9664f0.datachain-documentation.pages.dev
Branch Preview URL: https://isolate-temp-table-names.datachain-documentation.pages.dev

View logs

@shcheklein shcheklein force-pushed the isolate-temp-table-names branch from ee46cea to be6eeaf Compare September 7, 2025 17:47
Copy link

codecov bot commented Sep 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.84%. Comparing base (6eec7e7) to head (33e80f8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1321   +/-   ##
=======================================
  Coverage   88.84%   88.84%           
=======================================
  Files         155      155           
  Lines       14240    14241    +1     
  Branches     2025     2025           
=======================================
+ Hits        12652    12653    +1     
  Misses       1124     1124           
  Partials      464      464           
Flag Coverage Δ
datachain 88.78% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/query/dataset.py 93.46% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dreadatour
Copy link
Contributor

Fixes for the tests in separate PR: #1322

@shcheklein
Copy link
Member Author

Fixes for the tests in separate PR: #1322

yep, thanks @dreadatour ... I'll keep looking into this PR ... it is probably right for the current approach with temp tables, but I need to understand the whole temp table mechanics a bit better

@shcheklein shcheklein self-assigned this Sep 8, 2025
@shcheklein shcheklein added bug Something isn't working performance labels Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

It hangs for cleaning up tables
2 participants