-
Notifications
You must be signed in to change notification settings - Fork 134
UDF Checkpoints #1422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UDF Checkpoints #1422
Conversation
Reviewer's GuideThis PR implements a comprehensive, fine-grained checkpointing framework for UDF operations ( Sequence diagram for UDF checkpointed execution and recoverysequenceDiagram
participant actor User
participant DataChain
participant UDFStep
participant Metastore
participant Warehouse
User->>DataChain: Run .map()/.gen() with UDF
DataChain->>UDFStep: Apply UDF step (with job context)
UDFStep->>Metastore: Check for existing checkpoint (partial/final)
alt Final checkpoint exists
UDFStep->>Warehouse: Reuse output table, skip UDF execution
else Partial checkpoint exists
UDFStep->>Warehouse: Copy partial table, compute unprocessed rows
UDFStep->>Warehouse: Resume UDF only for unprocessed rows
Warehouse->>Metastore: Update processed table and checkpoint
else No checkpoint
UDFStep->>Warehouse: Create input table, run UDF from scratch
Warehouse->>Metastore: Create partial checkpoint, update processed table
end
UDFStep->>Metastore: On success, promote to final checkpoint
DataChain->>User: Return results, progress tracked
ER diagram for job-scoped UDF checkpoint tableserDiagram
Job {
string id PK
string parent_job_id
}
Checkpoint {
string id PK
string job_id FK
string hash
boolean partial
datetime created_at
}
Table {
string name PK
string type
}
Job ||--o{ Checkpoint : "has"
Job ||--o{ Table : "owns"
Checkpoint ||--o{ Table : "tracks"
Class diagram for UDFStep and checkpoint managementclassDiagram
class UDFStep {
+Session session
+apply(query_generator, temp_tables, *args, **kwargs)
+create_output_table(name)
+get_input_query(input_table_name, original_query)
+create_processed_table(checkpoint, copy_from_parent)
+populate_udf_output_table(udf_table, query, processed_table)
+get_or_create_input_table(query, hash)
+_checkpoint_exist(hash, partial)
+_skip_udf(checkpoint, hash_input, query)
+_run_from_scratch(hash_input, hash_output, query)
+_continue_udf(checkpoint, hash_output, query)
+calculate_unprocessed_rows(input_table, partial_table, processed_table, original_query)
+job
+metastore
+warehouse
}
class UDFSignal {
+create_output_table(name)
+create_result_query(udf_table, query)
+inherits UDFStep
}
class RowGenerator {
+create_output_table(name)
+create_processed_table(checkpoint, copy_from_parent)
+create_result_query(udf_table, query)
+inherits UDFStep
}
class Metastore {
+create_checkpoint(job_id, hash, partial)
+remove_checkpoint(checkpoint)
+get_ancestor_job_ids(job_id)
+find_checkpoint(job_id, hash, partial)
+get_last_checkpoint(job_id)
}
class Warehouse {
+insert_rows(table, rows, batch_size, batch_callback, tracking_field)
+rename_table(old_table, new_name)
+create_pre_udf_table(query, name)
+get_table(name)
+drop_table(table, if_exists)
+copy_table(target, query)
}
UDFSignal --|> UDFStep
RowGenerator --|> UDFStep
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Deploying datachain-documentation with
|
| Latest commit: |
f914b7f
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3ee78108.datachain-documentation.pages.dev |
| Branch Preview URL: | https://ilongin-1392-udf-checkpoints.datachain-documentation.pages.dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- You have the checkpoint table prefix (
udf_{job_id}_{hash}) hard-coded in multiple places—extract it into a shared constant or helper to avoid duplication and address the TODO you left about centralizing this. - There are quite a few TODO placeholders (e.g. partial checkpoint handling, debug prints, renaming tables) still in the core logic—please either implement or remove them before merging to avoid shipping unfinished behavior.
- The repeated addition of
*args, **kwargsto everyapplysignature suggests a heavier refactor; consider making the baseStep.applyaccept a uniform context object or keyword-only params so subclasses don’t have so much boilerplate.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- You have the checkpoint table prefix (`udf_{job_id}_{hash}`) hard-coded in multiple places—extract it into a shared constant or helper to avoid duplication and address the TODO you left about centralizing this.
- There are quite a few TODO placeholders (e.g. partial checkpoint handling, debug prints, renaming tables) still in the core logic—please either implement or remove them before merging to avoid shipping unfinished behavior.
- The repeated addition of `*args, **kwargs` to every `apply` signature suggests a heavier refactor; consider making the base `Step.apply` accept a uniform context object or keyword-only params so subclasses don’t have so much boilerplate.
## Individual Comments
### Comment 1
<location> `src/datachain/query/dataset.py:720-722` </location>
<code_context>
- temp_tables.append(udf_table.name)
- self.populate_udf_table(udf_table, query)
- q, cols = self.create_result_query(udf_table, query)
+ if self._checkpoint_exist(hash_after):
+ result = self._skip_udf(hash_before, query)
+ elif self._checkpoint_exist(hash_before) and udf_mode == "unsafe":
+ # TODO implement continuing with partial checkpoint
+ result = self._run_from_scratch(hash_before, query)
</code_context>
<issue_to_address>
**issue (bug_risk):** The checkpoint skipping and unsafe mode logic is not fully implemented.
Since partial checkpoint continuation in unsafe mode is not implemented, raise a NotImplementedError or emit a warning when this code path is reached to prevent silent failures.
</issue_to_address>
### Comment 2
<location> `src/datachain/query/dataset.py:728-730` </location>
<code_context>
+ else:
+ result = self._run_from_scratch(hash_before, query)
+
+ # TODO rename tables to have new job_id in table names since maybe we are
+ # just skipping this as we found checkpoint but they have old job_id in name
+ self.session.catalog.metastore.create_checkpoint(self.job.id, hash_after)
+
+ return result
</code_context>
<issue_to_address>
**issue (bug_risk):** Table renaming for checkpoint reuse is not implemented.
Leaving table renaming as a TODO may cause name collisions or incorrect data linkage. Please address this or implement interim safeguards.
</issue_to_address>
### Comment 3
<location> `src/datachain/query/dataset.py:453-454` </location>
<code_context>
+ If query cache is enabled, use the cached table; otherwise use the original
+ query.
+ """
+ if os.getenv("DATACHAIN_DISABLE_QUERY_CACHE", "") not in ("", "0"):
+ return original_query
+ table = self.session.catalog.warehouse.db.get_table(input_table_name)
</code_context>
<issue_to_address>
**suggestion:** Environment variable checks for query cache may be error-prone.
Using direct string comparison for environment variables can lead to inconsistent behavior. A utility function for parsing boolean values would improve reliability.
```suggestion
def env_var_is_true(env_value: str) -> bool:
"""Parse environment variable string to boolean."""
return env_value.strip().lower() in ("1", "true", "yes", "on")
if env_var_is_true(os.getenv("DATACHAIN_DISABLE_QUERY_CACHE", "")):
return original_query
```
</issue_to_address>
### Comment 4
<location> `src/datachain/catalog/catalog.py:2057` </location>
<code_context>
+ # UDF table prefix pattern: udf_{job_id}_{hash}
+ # TODO move this table prefix pattern to some common place as we
+ # repeat this in multiple places (e.g in UDFStep and here)
+ table_prefix = f"udf_{checkpoint.job_id}_{checkpoint.hash}"
+ matching_tables = self.warehouse.db.list_tables(prefix=table_prefix)
+ if matching_tables:
</code_context>
<issue_to_address>
**suggestion:** Table prefix pattern is duplicated across codebase.
Centralize the UDF table prefix pattern in a shared utility or constant to prevent duplication and simplify future updates.
Suggested implementation:
```python
from datachain.catalog.utils import get_udf_table_prefix
# Get UDF table prefix for this checkpoint
table_prefix = get_udf_table_prefix(checkpoint.job_id, checkpoint.hash)
matching_tables = self.warehouse.db.list_tables(prefix=table_prefix)
if matching_tables:
```
You need to create a new file `src/datachain/catalog/utils.py` (or add to an existing shared utils module) with the following function:
```python
def get_udf_table_prefix(job_id: str, hash: str) -> str:
return f"udf_{job_id}_{hash}"
```
Also, update any other places in the codebase (e.g. UDFStep) that construct the UDF table prefix to use this utility function for full deduplication.
</issue_to_address>
### Comment 5
<location> `tests/func/test_checkpoints.py:80-83` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 6
<location> `tests/func/test_checkpoints.py:91-96` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 7
<location> `tests/func/test_checkpoints.py:106-107` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 8
<location> `tests/func/test_checkpoints.py:132-137` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 9
<location> `tests/func/test_checkpoints.py:172-177` </location>
<code_context>
</code_context>
<issue_to_address>
**issue (code-quality):** Avoid loops in tests. ([`no-loop-in-tests`](https://docs.sourcery.ai/Reference/Rules-and-In-Line-Suggestions/Python/Default-Rules/no-loop-in-tests))
<details><summary>Explanation</summary>Avoid complex code, like loops, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
* loops
* conditionals
Some ways to fix this:
* Use parametrized tests to get rid of the loop.
* Move the complex logic into helpers.
* Move the complex part into pytest fixtures.
> Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / [Don't Put Logic in Tests](https://abseil.io/resources/swe-book/html/ch12.html#donapostrophet_put_logic_in_tests)
</details>
</issue_to_address>
### Comment 10
<location> `src/datachain/catalog/catalog.py:2058-2059` </location>
<code_context>
def _remove_checkpoint(self, checkpoint: Checkpoint) -> None:
"""
Remove a checkpoint and its associated UDF tables.
Internal helper method for checkpoint cleanup operations.
Args:
checkpoint: The checkpoint object to remove.
"""
# Find and drop UDF tables for this checkpoint
# UDF table prefix pattern: udf_{job_id}_{hash}
# TODO move this table prefix pattern to some common place as we
# repeat this in multiple places (e.g in UDFStep and here)
table_prefix = f"udf_{checkpoint.job_id}_{checkpoint.hash}"
matching_tables = self.warehouse.db.list_tables(prefix=table_prefix)
if matching_tables:
self.warehouse.cleanup_tables(matching_tables)
# Remove the checkpoint from metastore
self.metastore.remove_checkpoint(checkpoint)
</code_context>
<issue_to_address>
**suggestion (code-quality):** Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
```suggestion
if matching_tables := self.warehouse.db.list_tables(prefix=table_prefix):
```
</issue_to_address>
### Comment 11
<location> `src/datachain/catalog/catalog.py:2074-2077` </location>
<code_context>
def remove_checkpoint_by_hash(self, job_id: str, checkpoint_hash: str) -> None:
"""
Remove a specific checkpoint by job_id and hash, along with its UDF tables.
Args:
job_id: The job ID of the checkpoint to remove.
checkpoint_hash: The hash of the checkpoint to remove.
"""
# Find the checkpoint
checkpoint = self.metastore.find_checkpoint(job_id, checkpoint_hash)
if not checkpoint:
# Checkpoint doesn't exist, nothing to do
return
self._remove_checkpoint(checkpoint)
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Lift code into else after jump in control flow ([`reintroduce-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/reintroduce-else/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
</issue_to_address>
### Comment 12
<location> `src/datachain/data_storage/metastore.py:1890-1892` </location>
<code_context>
def create_checkpoint(
self,
job_id: str,
_hash: str,
partial: bool = False,
conn: Any | None = None,
) -> Checkpoint:
"""
Creates a new checkpoint or returns existing one if already exists.
This is idempotent - calling it multiple times with the same job_id and hash
will not create duplicates.
"""
# First check if checkpoint already exists
existing = self.find_checkpoint(job_id, _hash, partial=partial, conn=conn)
if existing:
return existing
checkpoint_id = str(uuid4())
query = self._checkpoints_insert().values(
id=checkpoint_id,
job_id=job_id,
hash=_hash,
partial=partial,
created_at=datetime.now(timezone.utc),
)
# Use on_conflict_do_nothing to handle race conditions
if hasattr(query, "on_conflict_do_nothing"):
query = query.on_conflict_do_nothing(index_elements=["job_id", "hash"])
self.db.execute(query, conn=conn)
return self.find_checkpoint(job_id, _hash, partial=partial, conn=conn) # type: ignore[return-value]
</code_context>
<issue_to_address>
**suggestion (code-quality):** Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
```suggestion
if existing := self.find_checkpoint(
job_id, _hash, partial=partial, conn=conn
):
```
</issue_to_address>
### Comment 13
<location> `tests/func/test_checkpoints.py:86` </location>
<code_context>
def test_cleanup_checkpoints_with_ttl(test_session, monkeypatch, nums_dataset):
"""Test that cleanup_checkpoints removes old checkpoints and their UDF tables."""
from datetime import datetime, timedelta, timezone
catalog = test_session.catalog
metastore = catalog.metastore
warehouse = catalog.warehouse
# Create some checkpoints by running a chain with map (which creates UDF tables)
reset_session_job_state()
chain = dc.read_dataset("nums", session=test_session)
chain.map(doubled=lambda num: num * 2, output=int).save("nums_doubled")
chain.map(tripled=lambda num: num * 3, output=int).save("nums_tripled")
job_id = test_session.get_or_create_job().id
checkpoints_before = list(metastore.list_checkpoints(job_id))
assert len(checkpoints_before) == 6
# Verify UDF tables exist
udf_tables = []
for checkpoint in checkpoints_before:
table_prefix = f"udf_{checkpoint.job_id}_{checkpoint.hash}"
matching_tables = warehouse.db.list_tables(prefix=table_prefix)
udf_tables.extend(matching_tables)
# At least some UDF tables should exist
assert len(udf_tables) > 0
# Modify checkpoint created_at to be older than TTL (4 hours by default)
ch = metastore._checkpoints
old_time = datetime.now(timezone.utc) - timedelta(hours=5)
for checkpoint in checkpoints_before:
metastore.db.execute(
metastore._checkpoints.update()
.where(ch.c.id == checkpoint.id)
.values(created_at=old_time)
)
# Run cleanup_checkpoints
catalog.cleanup_checkpoints()
# Verify checkpoints were removed
checkpoints_after = list(metastore.list_checkpoints(job_id))
assert len(checkpoints_after) == 0
# Verify UDF tables were removed
for table_name in udf_tables:
assert not warehouse.db.has_table(table_name)
</code_context>
<issue_to_address>
**issue (code-quality):** Simplify sequence length comparison [×2] ([`simplify-len-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-len-comparison/))
</issue_to_address>
### Comment 14
<location> `tests/func/test_checkpoints.py:143` </location>
<code_context>
def test_cleanup_checkpoints_with_custom_ttl(test_session, monkeypatch, nums_dataset):
"""Test that cleanup_checkpoints respects custom TTL from environment variable."""
from datetime import datetime, timedelta, timezone
catalog = test_session.catalog
metastore = catalog.metastore
# Set custom TTL to 1 hour
monkeypatch.setenv("CHECKPOINT_TTL", "3600")
# Create some checkpoints
reset_session_job_state()
chain = dc.read_dataset("nums", session=test_session)
chain.map(doubled=lambda num: num * 2, output=int).save("nums_doubled")
job_id = test_session.get_or_create_job().id
checkpoints = list(metastore.list_checkpoints(job_id))
assert len(checkpoints) == 3
# Modify all checkpoints to be 2 hours old (older than custom TTL)
ch = metastore._checkpoints
old_time = datetime.now(timezone.utc) - timedelta(hours=2)
for checkpoint in checkpoints:
metastore.db.execute(
metastore._checkpoints.update()
.where(ch.c.id == checkpoint.id)
.values(created_at=old_time)
)
# Run cleanup with custom TTL
catalog.cleanup_checkpoints()
# Verify checkpoints were removed
assert len(list(metastore.list_checkpoints(job_id))) == 0
</code_context>
<issue_to_address>
**suggestion (code-quality):** Simplify sequence length comparison ([`simplify-len-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-len-comparison/))
```suggestion
assert not list(metastore.list_checkpoints(job_id))
```
</issue_to_address>
### Comment 15
<location> `tests/func/test_checkpoints.py:183` </location>
<code_context>
def test_cleanup_checkpoints_for_specific_job(test_session, monkeypatch, nums_dataset):
"""Test that cleanup_checkpoints can target a specific job."""
from datetime import datetime, timedelta, timezone
catalog = test_session.catalog
metastore = catalog.metastore
# Create checkpoints for two different jobs
reset_session_job_state()
chain = dc.read_dataset("nums", session=test_session)
chain.map(doubled=lambda num: num * 2, output=int).save("nums_doubled")
first_job_id = test_session.get_or_create_job().id
reset_session_job_state()
chain.map(tripled=lambda num: num * 3, output=int).save("nums_tripled")
second_job_id = test_session.get_or_create_job().id
# Verify both jobs have checkpoints
first_checkpoints = list(metastore.list_checkpoints(first_job_id))
second_checkpoints = list(metastore.list_checkpoints(second_job_id))
assert len(first_checkpoints) == 3
assert len(second_checkpoints) == 3
# Make both checkpoints old
ch = metastore._checkpoints
old_time = datetime.now(timezone.utc) - timedelta(hours=5)
for checkpoint in first_checkpoints + second_checkpoints:
metastore.db.execute(
metastore._checkpoints.update()
.where(ch.c.id == checkpoint.id)
.values(created_at=old_time)
)
# Clean up only first job's checkpoints
catalog.cleanup_checkpoints(job_id=first_job_id)
# Verify only first job's checkpoints were removed
assert len(list(metastore.list_checkpoints(first_job_id))) == 0
assert len(list(metastore.list_checkpoints(second_job_id))) == 3
</code_context>
<issue_to_address>
**suggestion (code-quality):** Simplify sequence length comparison ([`simplify-len-comparison`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/simplify-len-comparison/))
```suggestion
assert not list(metastore.list_checkpoints(first_job_id))
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| self.job.id, checkpoint.hash | ||
| ) | ||
| output_table = self.create_output_table(current_output_table_name) | ||
| # Select only columns that exist in the source table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need to do any exclusions here. Should be just the same table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using output table to get the results but sys__input_id and sys__partial are reserved fields for UDF checkpoints internal logic and must not be in the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we keep those fields then? why don't we snapshot the proper result if it is done (we are probably copying the table anyway even the first time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not copying, that's the reason these fields are left. We just rename the table since it's much faster then doing copy without those fields. On the other hand exclusion on read is cheap (just column filtering in the SELECT).
dreadatour
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple more findings and questions below.
It is really hard to review this PR because of it's size. It would be great to separate unrelated changes (like session usage instead of catalog and many many more) to few separate PRs to ease reviewing on this PR.
In general it looks good to me, but I could miss some edge cases or potential issues because it is hard to keep all the changes in mind.
Also test are failing — it would be great to fix them.
Implements fine-grained checkpointing for UDF operations, enabling automatic recovery from failures with minimal recomputation.
Incremental Progress Tracking
.map()and.gen()executionCheckpoint Invalidation
New Environment Variable
DATACHAIN_UDF_RESET=1- Force UDF restart from scratch, ignoring partial progressImplementation Highlights
Job-Scoped Architecture
udf_{job_id}_{hash}_{suffix}Checkpoint Types
hash_input) - Allow continuation with modified codehash_output) - Invalidated if UDF changes.gen())Aggregations
.agg()creates checkpoints on completion onlyFiles Changed
src/datachain/query/dataset.py- Core UDF checkpoint logicsrc/datachain/data_storage/metastore.py- Checkpoint managementdocs/guide/checkpoints.md- Updated documentationtests/func/test_checkpoints.pyRelated
Fixes #1392
Summary by Sourcery
Introduce a comprehensive checkpointing framework for UDF execution: track and manage intermediate UDF results per job hash, support skipping or resuming work based on stored checkpoints, and add cleanup APIs to remove stale checkpoints and temporary tables.
New Features:
Enhancements:
Tests:
Summary by Sourcery
Implement UDF-level checkpointing to enable incremental progress tracking and automatic recovery for map, gen, and aggregation operations.
New Features:
Enhancements:
Documentation:
Tests: