Improve error messages for phantom db entries#18101
Improve error messages for phantom db entries#18101tugbataluy wants to merge 4 commits intocanonical:mainfrom
Conversation
…rupted cluster migrations (#17969) ## Checklist - [x] I have read the [contributing guidelines](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md) and attest that all commits in this PR are [signed off](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#including-a-signed-off-by-line-in-your-commits), [cryptographically signed](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-signature-verification), and follow this project's [commit structure](https://github.com/canonical/lxd/blob/main/CONTRIBUTING.md#commit-structure). - [x] I have checked and added or updated relevant documentation. ## Problem When a cluster migration is interrupted mid-transfer (network drop, kill -9 on target), phantom storage_volumes records can be left in the global database with no corresponding storage on disk (as mentioned in #17787). On retry, the consistency checks in `CreateInstanceFromMigration` and `CreateCustomVolumeFromMigration` return "Volume exists in database but not on storage", permanently blocking migration for the affected instance. Separately, dead migration connections were not detected promptly, leaving migrations hanging for minutes before timing out. ## Fix - Fast dead-connection detection (`lxd/migration_connection.go`) Enable `ws.StartKeepAlive()` on both incoming (`AcceptIncoming`) and outgoing (`WebSocket`) migration WebSocket connections. Dead connections are now detected in ~15 seconds instead of relying on TCP timeout defaults. - Integration test (`test/suites/clustering_move.sh`) Add a sub-test under `test_clustering_move that`: - Injects a phantom storage_volumes row via `lxd sql global` - Verifies the migration fails with the expected error. - Removes the phantom row. - Verifies migration then succeeds. ## What this does NOT do This PR does not auto-repair storage inconsistencies. [A follow-up PR](#18101) will improve the error message with recovery guidance and add documentation for the manual cleanup steps. Fixes #17787
…ume database entries. Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
…e and doc link. Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
…ror message Signed-off-by: tugbataluy <tugba.taluy@canonical.com>
43f3b9c to
48dd3b5
Compare
|
|
||
| if dbVol != nil && !volExists { | ||
| return errors.New("Volume exists in database but not on storage") | ||
| return fmt.Errorf("Volume %q exists in database on member %q but not on storage, this may be a phantom entry from a previous failed migration. See https://documentation.ubuntu.com/lxd/latest/howto/cluster_recover_volumes/", args.Name, dbVol.Location) |
There was a problem hiding this comment.
@minaelee do you have any guidance on including doc URLs in error? Should we avoid it, or if we do it, how should we handle the potential for broken/out of date URLs in the future.
Perhaps instead we should say something like "search for X in the docs"?
There was a problem hiding this comment.
I agree that we should avoid it due to the maintenance burden. I suggest something like "Refer to the how-to guide on recovering phantom volume entries in the documentation." Thanks!
| List all volume entries for the affected custom volume across cluster members: | ||
|
|
||
| lxd sql global "SELECT storage_volumes.id, storage_volumes.name, nodes.name AS member FROM storage_volumes JOIN nodes ON storage_volumes.node_id = nodes.id WHERE storage_volumes.name = '<volume-name>'" | ||
|
|
There was a problem hiding this comment.
| Replace `<volume-name>` with the name of the custom volume. |
|
|
||
| lxd sql global "DELETE FROM storage_volumes WHERE name='<volume-name>' AND node_id=(SELECT id FROM nodes WHERE name='<member-name>')" | ||
|
|
||
| Replace `<volume-name>` with the name of the custom volume and `<member-name>` with the cluster member that holds the phantom entry. |
There was a problem hiding this comment.
| Replace `<volume-name>` with the name of the custom volume and `<member-name>` with the cluster member that holds the phantom entry. | |
| Replace `<volume-name>`, as well as `<member-name>` with the cluster member that holds the phantom entry. |
| List all volume entries for the affected instance across cluster members: | ||
|
|
||
| lxd sql global "SELECT storage_volumes.id, storage_volumes.name, nodes.name AS member FROM storage_volumes JOIN nodes ON storage_volumes.node_id = nodes.id WHERE storage_volumes.name = '<instance-name>'" | ||
|
|
There was a problem hiding this comment.
| Replace `<instance-name>` with the name of the affected instance. |
|
|
||
| lxd sql global "DELETE FROM storage_volumes WHERE name='<instance-name>' AND node_id=(SELECT id FROM nodes WHERE name='<member-name>')" | ||
|
|
||
| Replace `<instance-name>` with the name of the instance and `<member-name>` with the cluster member that holds the phantom entry. |
There was a problem hiding this comment.
| Replace `<instance-name>` with the name of the instance and `<member-name>` with the cluster member that holds the phantom entry. | |
| Replace `instance-name`, as well as `<member-name>` with the cluster member that holds the phantom entry. |
| (cluster-recover-volumes)= | ||
| # How to recover phantom volume database entries | ||
|
|
||
| When a cluster migration (`lxc move <instance> --target <member>`) is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk. |
There was a problem hiding this comment.
| When a cluster migration (`lxc move <instance> --target <member>`) is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk. | |
| When a {ref}`cluster instance migration <howto-cluster-manage-instance-migrate>` or {ref}`custom volume migration <howto-storage-move-volume-cluster> is interrupted mid-transfer (for example, due to a network failure or a killed LXD process), the target member may be left with a volume record in the global database that has no corresponding storage on disk. |
I would like to see both instance migration and volume migration mentioned in the introduction like this, since this page covers how to deal with failures in both.
In doc > howto > storage_move_volume.md, line 79 (above ## Copy or migrate between cluster members, you will need to add the ref target used above, like this:
(howto-storage-move-volume-cluster)=
| After removing the phantom entry, retry the migration: | ||
|
|
||
| lxc move <instance> --target <member> |
There was a problem hiding this comment.
| After removing the phantom entry, retry the migration: | |
| lxc move <instance> --target <member> | |
| After removing the phantom entry, retry the {ref}`instance migration <howto-cluster-manage-instance-migrate>` |
We don't need to duplicate the command here when we can point to the how-to guide for this. If the use of the command changes in the future, then we won't need to update it in multiple places.
| After removing the phantom entry, retry the migration: | ||
|
|
||
| lxc storage volume move <pool>/<volume-name> --target <member> |
There was a problem hiding this comment.
| After removing the phantom entry, retry the migration: | |
| lxc storage volume move <pool>/<volume-name> --target <member> | |
| After removing the phantom entry, retry the {ref}`volume migration <howto-storage-move-volume-cluster>`. |
|
|
||
| if dbVol != nil && !volExists { | ||
| return errors.New("Volume exists in database but not on storage") | ||
| return fmt.Errorf("Volume %q exists in database on member %q but not on storage, this may be a phantom entry from a previous failed migration. See https://documentation.ubuntu.com/lxd/latest/howto/cluster_recover_volumes/", args.Name, dbVol.Location) |
There was a problem hiding this comment.
I agree that we should avoid it due to the maintenance burden. I suggest something like "Refer to the how-to guide on recovering phantom volume entries in the documentation." Thanks!
Checklist
Follow-up to #17969 (the first phantom DB entries PR):
CreateInstanceFromMigrationandCreateCustomVolumeFromMigrationto include the volume name, affected cluster member, and a link to the recovery documentation.