Skip to content

xapi-cli-server: stop using SR records for cross pool migrations #6905

Open
psafont wants to merge 2 commits intoxapi-project:26.1-lcmfrom
psafont:pau/dev/cross-pool-fix
Open

xapi-cli-server: stop using SR records for cross pool migrations #6905
psafont wants to merge 2 commits intoxapi-project:26.1-lcmfrom
psafont:pau/dev/cross-pool-fix

Conversation

@psafont
Copy link
Member

@psafont psafont commented Feb 10, 2026

SR records have variant fields that add new variants when storage features are added, as is the case with current_operations.
This means that fetching records from remote pool cause deserialization exceptions when their records contain newer, unknown variants when using OCaml-based clients.

xe uses SR records to minimize remote calls to select the SR with the largest amount of free space during cross-pool migrations.

Change xe to use expressions for filtering SRs and minimize the amount of calls done.
This means that now PBD records are fetched, but these don't contain any variants, and have been unchanged since early 2008.

Now the amount of calls have changed from 1 per host-attached PBD to 2, but the amount of calls can be limited to shared SRs, if there are any.

This is a prerequisite to reintroduce this change: 723a498#diff-22f9f03f35e06c26ffd8f5ee9a8573167dae627ac8927d260d798c47eacbc9ceR29, which is needed to be able to add new SR operations, like VDI_revert.

I'm opening it as a draft PR to gather comments while I test that it fixes the migration of a vm to a host with SRs with new, unknown capabilities.

Copy link
Member

@edwintorok edwintorok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a new dedicated operation to do the lookups on the coordinator's database directly where it has fast,local access to the DB?
In extreme cases there could be a lot of shared SRs, e.g. in one use-case there would be 1 PBD for an NFS SR per VM, per host! So you could have hundreds of these/host.

(* The preferred SR is determined to be as the SR that the destine host has a PDB attached to it,
and among the choices of that the shared is preferred first(as it is recommended to have shared storage
in pool to host VMs), and then the one with the maximum available space *)
(* The preferred SR is determined to be as the SR that the destine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo destination

@psafont
Copy link
Member Author

psafont commented Feb 10, 2026

In extreme cases there could be a lot of shared SRs, e.g. in one use-case there would be 1 PBD for an NFS SR per VM, per host! So you could have hundreds of these/host.

The VDI map should be able to bypass the selection of the destination SR, when all the vdis are correctly mapped. Interestingly it does the SR selection just in case, and only if it's succesful it will overlay that onto the vdi_map

…ross-pool migration

SR records have variant fields that add new variants when storage
features are added, as is the case with current_operations. This means
that fetching records from remote pools cause deserialization exceptions
when their records contain newer, unknown variants.

xe uses SR records to minimize the amount of remote calls needed for
selecting the SR with the highest amount of free space during cross-pool
migrations.

Change xe to fetch references of only non-ISO SRs, and filter PBD
records with them. This means that now PBD records are fetched, but
these don't contain any variants, and have been unchanged since early
2008.

This has the side effect of reducing the amount of calls from 2 per
host-attached PBDS to 1 per non-iso, host-attached PBD with the
additional cost of a single call to fetch the non-iso SRs

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
SR records have variant fields that add new variants when storage features are added, as is the case with current_operations.
This means that fetching records from remote pool cause deserialization exceptions when their records contain newer, unknown variants when using OCaml-based clients.

Use expression-based queries to separate shared from local SRs without records, and add 2 queries per SR to gather the free space.

Now the amount of calls have changed from 1 per host-attached PBD to 2,
but the amount of calls can be limited to shared SRs, if there are any.

Signed-off-by: Pau Ruiz Safont <pau.safont@vates.tech>
@psafont psafont force-pushed the pau/dev/cross-pool-fix branch from 2effd6a to 13d23f3 Compare February 12, 2026 14:57
@psafont
Copy link
Member Author

psafont commented Feb 12, 2026

I just migrated using xe from a 25.33.1 host to another one with 26.1.0, to an SR with VDI.revert and I didn't see any errors, so I'm puzzled as to what caused the failure in late 2024 that made us revert the change about allowed SR operations.

Maybe it's because the VM is halted? but I don't see how it should affect if it's not a live migration

[dest] # xe sr-list allowed-operations:contains=VDI.revert
uuid ( RO)                : ace8472f-7e33-9a9a-0716-dd8699ad9926
          name-label ( RW): local-files
    name-description ( RW):
                host ( RO): metavega
                type ( RO): file
        content-type ( RO):
[source] # yum info xapi-core
Installed Packages
Name        : xapi-core
Arch        : x86_64
Version     : 25.33.1
Release     : 2.3.xcpng8.3
[source] # xe vm-migrate vm=$VM remote-master=$DEST remote-username=root remote-password=xxx vdi:$VDI=ace8472f-7e33-9a9a-0716-dd8699ad9926
Performing a storage live migration. Your VM's VDIs will be migrated with the VM.
Will migrate to remote host: metavega, using remote network: Pool-wide network associated with eth0. Here is the VDI mapping:
VDI a32bfbc3-cd16-4480-8c7f-45742aa49c8f -> SR ace8472f-7e33-9a9a-0716-dd8699ad9926
[dest] # xe vm-list name-label=$VM --minimal
c04d1180-3921-50e8-19bb-d0a251f62eae

I'll try a live migration, and adding all operations to the allowed ops to try to force an error.

@psafont
Copy link
Member Author

psafont commented Feb 13, 2026

I've tested with allowing to advertise all the operations in the receiving SR, without issues.

destination:

      allowed-operations (SRO): VDI.enable_cbt; VDI.list_changed_blocks; VDI.forget; unplug; VDI.blocked; plug; PBD.create; VDI.disable_cbt; VDI.force_unlock; update; VDI.generate_config; PBD.destroy; VDI.resize; VDI.copy; VDI.clone; VDI.data_destroy; VDI.resize_online; scan; VDI.snapshot; VDI.mirror; VDI.create; VDI.revert; VDI.destroy; VDI.update

advertised features until now:

let all_ops : API.storage_operations_set =
  [
    `scan
  ; `destroy
  ; `forget
  ; `plug
  ; `unplug
  ; `vdi_create
  ; `vdi_destroy
  ; `vdi_resize
  ; `vdi_clone
  ; `vdi_snapshot
  ; `vdi_mirror
  ; `vdi_enable_cbt
  ; `vdi_disable_cbt
  ; `vdi_data_destroy
  ; `vdi_list_changed_blocks
  ; `vdi_set_on_boot
  ; `vdi_introduce
  ; `vdi_revert
  ; `update
  ; `pbd_create
  ; `pbd_destroy
  ]

Additional features exposed:

  • VDI.forget
  • VDI.blocked
  • VDI.force_unlock
  • VDI.generate_config
  • VDI.copy
  • VDI.resize_online
  • VDI.revert
  • VDI.update

I still need to properly test a live migration, but I'm surprised it works without this change, I'm now left wondering in what triggered the error in CA-402263

@minglumlu
Copy link
Member

I understand deserializing on a new/unknown value is a general problem on all variants and enums. To avoid the potential exceptions, it seems that the SDK needs to provide a default unknown always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants