Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Nov 20, 2025

Implementing UDF checkpoints for aggregator.

@ilongin ilongin marked this pull request as draft November 20, 2025 03:18
@ilongin ilongin linked an issue Nov 20, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/lib/udf.py 75.00% 0 Missing and 2 partials ⚠️
src/datachain/query/dataset.py 95.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ilongin ilongin marked this pull request as ready for review November 20, 2025 14:37
@ilongin ilongin mentioned this pull request Nov 20, 2025
self.setup()

# Check if partition_id is available (when partition_by is used)
partition_id_idx = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
partition_id_idx = None
partition_id_idx: int | None = None

)
# Include sys__input_id to track which partition produced each output
output = [
{"sys__input_id": input_id}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add sys__input_id as constant (same way we added from datachain.data_storage.schema import PARTITION_COLUMN_ID)?

Comment on lines +645 to +649
Create table with partition mappings (sys__id -> partition_id).
Args:
query: Input query with sys__id column
table_name: Name for the partition table.

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement UDF checkpoints for aggregator

3 participants