Opening a large archive (100GB+) can take 10+ minutes with no meaningful feedback during the DuckDB import step, which is likely the dominant bottleneck. The current create_from_core_files uses a single CREATE TABLE AS SELECT * FROM read_csv(...) — one atomic SQL call with no progress API.
Proposed approach
Replace the read_csv SQL with Rust-side CSV streaming into DuckDB's Appender API in batches. Thread a progress: f64 callback (0.0–1.0) up through create_from_core_files → Archive::open → open_archive, and surface it as real percentage progress on the loading screen.
Key changes
src-tauri/Cargo.toml: add csv = "1.3.1"
ArchiveOpenProgress::CreatingDatabase: add progress: f64 field; change internal progress channel from String to the enum directly
Database::create_from_core_files: add on_progress: F callback parameter; count rows first (cheap line scan), build CREATE TABLE DDL from sniffed columns + TYPE_OVERRIDES, stream rows via Appender in 50k-row batches, calling on_progress after each flush
dwca/archive.rs: thread progress closure into create_from_core_files
+page.svelte: read progress.progress on creatingDatabase events; update archiveLoadingProgress derivation to use actual value in the 40–95% range
Type conversion per row
VARCHAR: pass as &str
DOUBLE (decimalLatitude, decimalLongitude): parse::<f64>().ok() → Option<f64>, empty → None
BOOLEAN (captive, hasCoordinate, etc.): match "true"/"1" → Some(true), "false"/"0" → Some(false), "" → None
Extension tables are left on read_csv for now (usually much smaller than occurrences).
Risks
| Risk |
Mitigation |
| Appender column order diverges from CREATE TABLE |
Build both from a single shared Vec<(col, type)> |
| BOOLEAN format variation across archives |
Lowercase + check both forms; warn on unrecognised values |
| Archives with no rows |
Guard total_rows == 0, call on_progress(1.0) immediately |
| Regression on NULL / empty-column-drop behaviour |
Existing tests cover this; add explicit test for NULL handling in Appender path |
Why this is an experiment
Estimated 50k–100k tokens to implement, with the wide range driven by how finicky the Appender type handling turns out to be. The improvement is real but only matters for very large archives. Worth doing if those users are a priority.
Opening a large archive (100GB+) can take 10+ minutes with no meaningful feedback during the DuckDB import step, which is likely the dominant bottleneck. The current
create_from_core_filesuses a singleCREATE TABLE AS SELECT * FROM read_csv(...)— one atomic SQL call with no progress API.Proposed approach
Replace the
read_csvSQL with Rust-side CSV streaming into DuckDB'sAppenderAPI in batches. Thread aprogress: f64callback (0.0–1.0) up throughcreate_from_core_files→Archive::open→open_archive, and surface it as real percentage progress on the loading screen.Key changes
src-tauri/Cargo.toml: addcsv = "1.3.1"ArchiveOpenProgress::CreatingDatabase: addprogress: f64field; change internal progress channel fromStringto the enum directlyDatabase::create_from_core_files: addon_progress: Fcallback parameter; count rows first (cheap line scan), buildCREATE TABLEDDL from sniffed columns +TYPE_OVERRIDES, stream rows viaAppenderin 50k-row batches, callingon_progressafter each flushdwca/archive.rs: thread progress closure intocreate_from_core_files+page.svelte: readprogress.progressoncreatingDatabaseevents; updatearchiveLoadingProgressderivation to use actual value in the 40–95% rangeType conversion per row
VARCHAR: pass as&strDOUBLE(decimalLatitude, decimalLongitude):parse::<f64>().ok()→Option<f64>, empty →NoneBOOLEAN(captive, hasCoordinate, etc.): match"true"/"1"→Some(true),"false"/"0"→Some(false),""→NoneExtension tables are left on
read_csvfor now (usually much smaller than occurrences).Risks
Vec<(col, type)>total_rows == 0, callon_progress(1.0)immediatelyWhy this is an experiment
Estimated 50k–100k tokens to implement, with the wide range driven by how finicky the Appender type handling turns out to be. The improvement is real but only matters for very large archives. Worth doing if those users are a priority.