Improve error messages for phantom db entries by tugbataluy · Pull Request #18101 · canonical/lxd

tugbataluy · 2026-04-10T11:22:13Z

Checklist

I have read the contributing guidelines and attest that all commits in this PR are signed off, cryptographically signed, and follow this project's commit structure.
I have checked and added or updated relevant documentation.

Follow-up to #17969 (the first phantom DB entries PR):

Add a documentation page covering manual recovery of phantom volume database entries left by interrupted cluster migrations.
Improve the error message in CreateInstanceFromMigration and CreateCustomVolumeFromMigration to include the volume name, affected cluster member, and a link to the recovery documentation.
Update the integration test assertion to match the improved error message (after rebase on the first PR landed)

…rupted cluster migrations (#17969) ## Checklist - [x] I have read the [contributing guidelines](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md) and attest that all commits in this PR are [signed off](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#including-a-signed-off-by-line-in-your-commits), [cryptographically signed](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-signature-verification), and follow this project's [commit structure](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-structure). - [x] I have checked and added or updated relevant documentation. ## Problem When a cluster migration is interrupted mid-transfer (network drop, kill -9 on target), phantom storage_volumes records can be left in the global database with no corresponding storage on disk (as mentioned in #17787). On retry, the consistency checks in `CreateInstanceFromMigration` and `CreateCustomVolumeFromMigration` return "Volume exists in database but not on storage", permanently blocking migration for the affected instance. Separately, dead migration connections were not detected promptly, leaving migrations hanging for minutes before timing out. ## Fix - Fast dead-connection detection (`lxd/migration_connection.go`) Enable `ws.StartKeepAlive()` on both incoming (`AcceptIncoming`) and outgoing (`WebSocket`) migration WebSocket connections. Dead connections are now detected in ~15 seconds instead of relying on TCP timeout defaults. - Integration test (`test/suites/clustering_move.sh`) Add a sub-test under `test_clustering_move that`: - Injects a phantom storage_volumes row via `lxd sql global` - Verifies the migration fails with the expected error. - Removes the phantom row. - Verifies migration then succeeds. ## What this does NOT do This PR does not auto-repair storage inconsistencies. [A follow-up PR](#18101) will improve the error message with recovery guidance and add documentation for the manual cleanup steps. Fixes #17787

…ume database entries. Signed-off-by: tugbataluy <tugba.taluy@canonical.com>

Signed-off-by: tugbataluy <tugba.taluy@canonical.com>

…e and doc link. Signed-off-by: tugbataluy <tugba.taluy@canonical.com>

…ror message Signed-off-by: tugbataluy <tugba.taluy@canonical.com>

tomponline · 2026-04-10T14:47:42Z

lxd/storage/backend_lxd.go


 	if dbVol != nil && !volExists {
-		return errors.New("Volume exists in database but not on storage")
+		return fmt.Errorf("Volume %q exists in database on member %q but not on storage, this may be a phantom entry from a previous failed migration. See https://documentation.ubuntu.com/lxd/latest/howto/cluster_recover_volumes/", args.Name, dbVol.Location)


@minaelee do you have any guidance on including doc URLs in error? Should we avoid it, or if we do it, how should we handle the potential for broken/out of date URLs in the future.

Perhaps instead we should say something like "search for X in the docs"?

I agree that we should avoid it due to the maintenance burden. I suggest something like "Refer to the how-to guide on recovering phantom volume entries in the documentation." Thanks!

minaelee · 2026-04-11T08:49:42Z