Skip to content

Conversation

@ilongin
Copy link
Contributor

@ilongin ilongin commented Jan 23, 2026

Added a new InsertBuffer (moved from clickhouse.py) class in data_storage/buffer.py that provides unified batch insert handling with time-based flushing support. This reduces data loss on UDF failures by automatically flushing results to the database at regular intervals (default: 60 seconds), regardless of batch size.

Introduced a new flush_interval setting that can be configured via DataChain.settings(flush_interval=...) and which is used by InsertBuffer

Refactored SQLite's insert_rows to use the new InsertBuffer

TODO: InsertBuffer unit tests

@ilongin ilongin marked this pull request as draft January 23, 2026 15:42
@ilongin ilongin linked an issue Jan 23, 2026 that may be closed by this pull request
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 23, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2cd3f56
Status: ✅  Deploy successful!
Preview URL: https://26dc203a.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-1519-flush-insert-bu.datachain-2g6.pages.dev

View logs

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 82.10526% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/lib/settings.py 31.25% 8 Missing and 3 partials ⚠️
src/datachain/data_storage/sqlite.py 84.00% 2 Missing and 2 partials ⚠️
src/datachain/data_storage/buffer.py 95.91% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

_project: str | None
_min_task_size: int | None
_batch_size: int | None
_flush_interval: float | None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, we do have tests for settings, it would be great to update tests with flush_interval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flush insert buffer based on time

3 participants