Skip to content

fix(kine): remove corrupt WAL artefacts on startup to break boot loop#148

Open
stevensbkang wants to merge 3 commits intodevelopfrom
feat/ks-51/kine-db-corruption
Open

fix(kine): remove corrupt WAL artefacts on startup to break boot loop#148
stevensbkang wants to merge 3 commits intodevelopfrom
feat/ks-51/kine-db-corruption

Conversation

@stevensbkang
Copy link
Copy Markdown
Member

@stevensbkang stevensbkang commented May 7, 2026

After an unclean shutdown (power loss), the SQLite WAL and SHM files can be left in an inconsistent state, causing the apiserver identity lease blob to decode with an empty UID. This trips the storage precondition check (ErrCodeInvalidObj), which cascades through the RBAC bootstrap hook into a KubeSolo abort on every subsequent boot — requiring manual deletion of state.db-wal and state.db-shm to recover.

On startup, kine now runs PRAGMA quick_check against state.db before handing control to the kine server. If the check fails (or the file cannot be opened), state.db-wal and state.db-shm are removed, falling back to the last cleanly checkpointed state in state.db. A clean database passes through untouched.

Closes #145

After an unclean shutdown (power loss), the SQLite WAL and SHM files can
be left in an inconsistent state, causing the apiserver identity lease
blob to decode with an empty UID. This trips the storage precondition
check (ErrCodeInvalidObj), which cascades through the RBAC bootstrap hook
into a KubeSolo abort on every subsequent boot — requiring manual deletion
of state.db-wal and state.db-shm to recover.

On startup, kine now runs PRAGMA quick_check against state.db before
handing control to the kine server. If the check fails (or the file
cannot be opened), state.db-wal and state.db-shm are removed, falling
back to the last cleanly checkpointed state in state.db. A clean database
passes through untouched.
@linear
Copy link
Copy Markdown

linear Bot commented May 7, 2026

…default)

The integrity check and WAL artefact removal introduced for power-loss
recovery is now opt-in via --db-wal-repair (KUBESOLO_DB_WAL_REPAIR=true).
The flag defaults to false to avoid any risk of silent WAL data loss on
deployments that do not need it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubesolo fails to start on boot due to apiserver lease UID mismatch / RBAC bootstrap failure / Database Corruption

1 participant