Skip to content

Improve error messages for phantom db entries#18101

Draft
tugbataluy wants to merge 4 commits intocanonical:mainfrom
tugbataluy:improve_error_messages_for_phantom_db_entries
Draft

Improve error messages for phantom db entries#18101
tugbataluy wants to merge 4 commits intocanonical:mainfrom
tugbataluy:improve_error_messages_for_phantom_db_entries

Conversation

@tugbataluy
Copy link
Copy Markdown
Contributor

@tugbataluy tugbataluy commented Apr 10, 2026

Checklist

Follow-up to #17969 (the first phantom DB entries PR):

  • Add a documentation page covering manual recovery of phantom volume database entries left by interrupted cluster migrations.
  • Improve the error message in CreateInstanceFromMigration and CreateCustomVolumeFromMigration to include the volume name, affected cluster member, and a link to the recovery documentation.
  • Update the integration test assertion to match the improved error message (after rebase on the first PR landed)

@github-actions github-actions bot added the Documentation Documentation needs updating label Apr 10, 2026
@tomponline tomponline requested a review from minaelee April 10, 2026 11:23
tomponline added a commit that referenced this pull request Apr 10, 2026
…rupted cluster migrations (#17969)

## Checklist

- [x] I have read the [contributing
guidelines](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md)
and attest that all commits in this PR are [signed
off](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#including-a-signed-off-by-line-in-your-commits),
[cryptographically
signed](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-signature-verification),
and follow this project's [commit
structure](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-structure).
- [x] I have checked and added or updated relevant documentation.

## Problem
When a cluster migration is interrupted mid-transfer (network drop, kill
-9 on target), phantom storage_volumes records can be left in the global
database with no corresponding storage on disk (as mentioned in #17787).

On retry, the consistency checks in `CreateInstanceFromMigration` and
`CreateCustomVolumeFromMigration` return "Volume exists in database but
not on storage", permanently blocking migration for the affected
instance.

Separately, dead migration connections were not detected promptly,
leaving migrations hanging for minutes before timing out.

## Fix
- Fast dead-connection detection (`lxd/migration_connection.go`)
Enable `ws.StartKeepAlive()` on both incoming (`AcceptIncoming`) and
outgoing (`WebSocket`) migration WebSocket connections. Dead connections
are now detected in ~15 seconds instead of relying on TCP timeout
defaults.

- Integration test (`test/suites/clustering_move.sh`)

Add a sub-test under `test_clustering_move that`:
- Injects a phantom storage_volumes row via `lxd sql global`
- Verifies the migration fails with the expected error.
- Removes the phantom row.
- Verifies migration then succeeds.

## What this does NOT do
This PR does not auto-repair storage inconsistencies. [A follow-up
PR](#18101) will improve the error
message with recovery guidance and add documentation for the manual
cleanup steps.

Fixes #17787
…ume database entries.

Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
…e and doc link.

Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
…ror message

Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
@tugbataluy tugbataluy force-pushed the improve_error_messages_for_phantom_db_entries branch from 43f3b9c to 48dd3b5 Compare April 10, 2026 14:02
@tugbataluy tugbataluy marked this pull request as ready for review April 10, 2026 14:24

if dbVol != nil && !volExists {
return errors.New("Volume exists in database but not on storage")
return fmt.Errorf("Volume %q exists in database on member %q but not on storage, this may be a phantom entry from a previous failed migration. See https://documentation.ubuntu.com/lxd/latest/howto/cluster_recover_volumes/", args.Name, dbVol.Location)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minaelee do you have any guidance on including doc URLs in error? Should we avoid it, or if we do it, how should we handle the potential for broken/out of date URLs in the future.

Perhaps instead we should say something like "search for X in the docs"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should avoid it due to the maintenance burden. I suggest something like "Refer to the how-to guide on recovering phantom volume entries in the documentation." Thanks!

List all volume entries for the affected custom volume across cluster members:

lxd sql global "SELECT storage_volumes.id, storage_volumes.name, nodes.name AS member FROM storage_volumes JOIN nodes ON storage_volumes.node_id = nodes.id WHERE storage_volumes.name = '<volume-name>'"

Copy link
Copy Markdown
Contributor

@minaelee minaelee Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace `<volume-name>` with the name of the custom volume.


lxd sql global "DELETE FROM storage_volumes WHERE name='<volume-name>' AND node_id=(SELECT id FROM nodes WHERE name='<member-name>')"

Replace `<volume-name>` with the name of the custom volume and `<member-name>` with the cluster member that holds the phantom entry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace `<volume-name>` with the name of the custom volume and `<member-name>` with the cluster member that holds the phantom entry.
Replace `<volume-name>`, as well as `<member-name>` with the cluster member that holds the phantom entry.

List all volume entries for the affected instance across cluster members:

lxd sql global "SELECT storage_volumes.id, storage_volumes.name, nodes.name AS member FROM storage_volumes JOIN nodes ON storage_volumes.node_id = nodes.id WHERE storage_volumes.name = '<instance-name>'"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace `<instance-name>` with the name of the affected instance.


lxd sql global "DELETE FROM storage_volumes WHERE name='<instance-name>' AND node_id=(SELECT id FROM nodes WHERE name='<member-name>')"

Replace `<instance-name>` with the name of the instance and `<member-name>` with the cluster member that holds the phantom entry.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replace `<instance-name>` with the name of the instance and `<member-name>` with the cluster member that holds the phantom entry.
Replace `instance-name`, as well as `<member-name>` with the cluster member that holds the phantom entry.

(cluster-recover-volumes)=
# How to recover phantom volume database entries

When a cluster migration (`lxc move <instance> --target <member>`) is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a cluster migration (`lxc move <instance> --target <member>`) is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk.
When a {ref}`cluster instance migration <howto-cluster-manage-instance-migrate>` or {ref}`custom volume migration <howto-storage-move-volume-cluster> is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk.

I would like to see both instance migration and volume migration mentioned in the introduction like this, since this page covers how to deal with failures in both.

In doc > howto > storage_move_volume.md, line 79 (above ## Copy or migrate between cluster members, you will need to add the ref target used above, like this:

(howto-storage-move-volume-cluster)=

Comment on lines +27 to +29
After removing the phantom entry, retry the migration:

lxc move <instance> --target <member>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After removing the phantom entry, retry the migration:
lxc move <instance> --target <member>
After removing the phantom entry, retry the {ref}`instance migration <howto-cluster-manage-instance-migrate>`

We don't need to duplicate the command here when we can point to the how-to guide for this. If the use of the command changes in the future, then we won't need to update it in multiple places.

Comment on lines +51 to +53
After removing the phantom entry, retry the migration:

lxc storage volume move <pool>/<volume-name> --target <member>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After removing the phantom entry, retry the migration:
lxc storage volume move <pool>/<volume-name> --target <member>
After removing the phantom entry, retry the {ref}`volume migration <howto-storage-move-volume-cluster>`.


if dbVol != nil && !volExists {
return errors.New("Volume exists in database but not on storage")
return fmt.Errorf("Volume %q exists in database on member %q but not on storage, this may be a phantom entry from a previous failed migration. See https://documentation.ubuntu.com/lxd/latest/howto/cluster_recover_volumes/", args.Name, dbVol.Location)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should avoid it due to the maintenance burden. I suggest something like "Refer to the how-to guide on recovering phantom volume entries in the documentation." Thanks!

@tomponline tomponline marked this pull request as draft April 11, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Documentation needs updating

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants