refactor(db): typed read/write pools with retry-on-busy#666
Conversation
split each sqlite database into a read pool (multi-connection) and a dedicated single-connection write pool. funnels in-process writes through a queue rather than fighting for the writer lock, leaving WAL-mode reads to run concurrently. centralise the sqlite connection builder in microsandbox-db so host and runtime apply identical PRAGMAs (WAL, busy_timeout, foreign_keys, synchronous=normal) from one place. drops the SQLITE_PRAGMAS string constant that was previously executed via raw SQL. add with_retry_transaction: write paths retry on SQLITE_BUSY (5) and SQLITE_BUSY_SNAPSHOT (517) with exponential backoff. busy detection matches sea_orm::DbErr structurally rather than string-searching the display output. expose busy_timeout via the global config.json only. the runtime uses microsandbox_db::pool::DEFAULT_BUSY_TIMEOUT_SECS since it is not user-configurable.
stacks on #666. ## summary we already split each database into a read pool and a write pool, but both were just `sea_orm::DatabaseConnection`. nothing in the type system stopped a write from sneaking onto the read pool, and the retry-on-busy logic had to look up the right pool dynamically. this pr makes the read/write split a property of the type. `DbReadConnection` and `DbWriteConnection` are newtypes over the underlying connection. both implement `ConnectionTrait` so existing query builders work unchanged, but write transactions are only available on `DbWriteConnection`. every host call site now declares its intent in its signature: read helpers take `&DbReadConnection`, write helpers take `&DbWriteConnection`, and the few that do both take `&DbPools`. with the typed handles in place, the dynamic write-pool lookup and the per-call max-connections override are gone, and two writes that were managing their own transactions pick up retry-on-busy for free. a 200-concurrent-boot benchmark shows a ~18% throughput improvement over the previous design (21.3 sandboxes/sec vs 18.2), with median wall time falling from ~6992 ms to ~5950 ms. zero `SQLITE_BUSY` errors observed in either run. ## test plan - [x] cargo test -p microsandbox --lib - [x] cargo check --workspace --all-targets
# Conflicts: # crates/microsandbox/lib/db/mod.rs
# Conflicts: # crates/microsandbox/lib/image/mod.rs # crates/utils/lib/lib.rs
addendum: linux benchmarkran the same benchmark on a fresh gcp 200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch:
main loses 117–143 sandboxes per run to sqlite contention. wall time is similar on both branches (~46s) because at this concurrency the wall clock is bounded by per-vm boot time, not by db throughput. the fix moves the failure mode entirely. |
# Conflicts: # crates/microsandbox/lib/sandbox/handle.rs # crates/microsandbox/lib/sandbox/metrics.rs
30ad1b3 to
7566636
Compare
override execute / execute_unprepared / query_one / query_all on dbwriteconnection to wrap each call in retry_on_busy. every entity::insert(...).exec(db_write) site across the workspace now retries on sqlite_busy / sqlite_busy_snapshot transparently. the dual-pool design already eliminated intra-process contention, but inter-process contention is fundamental at high concurrency and was previously unhandled at the trait level. drop the name parameter from retry_on_busy and dbwriteconnection::transaction. callers that want operation context in retry warnings can wrap in a tracing span. updated the 7 existing transaction() callers in sandbox/mod.rs and image/mod.rs accordingly. tighten doc comments on transaction() to explain when to use it (multi-statement atomic writes) versus auto-commit (single-statement writes that now retry automatically via the trait). validated at 50-conc on a 4-vcpu host (12x oversubscription): success rate improved from 32/50 (18 sqlite_busy errors) to 50/50 (0 errors).
the sdk's resolve-binary.js prefers the platform-pkg-bundled msb (node_modules/@superradcompany/microsandbox-linux-x64-gnu/bin/msb) over ~/.microsandbox/bin/msb. in ci that bundled msb is the last published release (e.g. 0.4.3), while the freshly-built msb installed at ~/.microsandbox/bin/msb is from the pr branch. when the sdk passes a flag the older msb doesn't recognise, the spawn fails with clap exit 2 (unix_wait_status(512)) and an empty startup line, causing the smoke suite to fail before exercising anything. set MSB_PATH explicitly so the sdk uses the just-built binary. confirmed locally: smoke suite went from 1 failed / 4 skipped to 5/5 passed. the integration test job is unaffected — it uses the host crate's resolve_msb_path() which prefers ~/.microsandbox/bin/msb over PATH.
resolved conflicts: - .github/workflows/check.yml: kept both env vars from each side. main's MSB_HOME isolates per-job ~/.microsandbox state across ci runs; our MSB_PATH forces the sdk to use the freshly-built msb instead of the bundled platform-pkg one. - crates/microsandbox/lib/db/mod.rs: took main's microsandbox_utils:: resolve_home() helper instead of the local dirs::home_dir + base join. resolve_home() respects MSB_HOME, which is the point of pr #669.
tl;dr: msb's "slowness" is nested-kvm's tax, not msb'sbenchmarked on google cloud:
at 200 concurrent boots, nested kvm cliffs catastrophically: success collapses from 128/200 in run 1 to 0/200 by run 3. bare metal: 200/200 every run in ~1.5 s. perf showed kernel time on nested kvm dominated by ept mapping + tlb flushes; every vmexit pays a ~5–10× tax that disappears on metal. next: a guide for nested-virt tuning. early testing suggests the available knobs (numa balancing, transparent hugepages, scheduler) only buy ~10%. |
summary
sqlite is single-writer, so when several callers in the same process try to write at once they fight for the writer lock and surface
SQLITE_BUSYerrors. this pr serialises in-process writes, makes the read/write split a property of the type system, and centralises busy handling so that contention turns into a deterministic queue rather than a flurry of retries.each database now opens two pools wrapped as distinct types:
DbReadConnection(multi-connection, concurrent reads under wal)DbWriteConnection(single connection, retry-on-busy built into itstransactionmethod)both implement
ConnectionTrait, so existing query builders work unchanged. function signatures across the codebase declare intent: read helpers take&DbReadConnection, write helpers take&DbWriteConnection, and the few that do both take&DbPools.cross-process busy contention (e.g. host vs in-vm runtime) is absorbed by an exponential-backoff retry helper. busy detection matches the
sqlx::Errorstructure rather than searching the display string.the sqlite connection builder also moves into the
microsandbox-dbcrate so the host cli and the runtime apply identical pragmas (wal, busy_timeout, foreign_keys, synchronous=normal) from one place. the only user-facing tuning knob isdatabase.busy_timeout_secsin~/.microsandbox/config.json.benchmark
200 concurrent boots × 3 runs, alpine, 64 mib, fresh db on each branch. the 73 vm-layer failures per run on both sides are a macos hypervisor.framework resource ceiling (memory pressure at 200 × 64 mib) and not affected by this pr, so the table normalises against the 127 boots that cleared the vm ceiling.
SQLITE_BUSYerrorsmain loses 49–114 sandboxes per run to sqlite contention; the branch eliminates those failures entirely.
results are also far more deterministic on the branch: every run hits exactly 127 successful boots, while main swings between 25 and 99.
test plan