Skip to content

[BUG] V2 volume stuck in volume attachment (V2 interrupt mode) #11816

@mcerveny

Description

@mcerveny

Describe the Bug

Try to attach volume V2 volume (V2 in interrupt mode, pool mode works) does not never finish (disks are native NVMe).

[longhorn-instance-manager] time="2025-09-18T13:57:29.642404886Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:116" dataEngine=DATA_ENGINE_V2 name=v2test36-r-a9f41bfc type=replica upgradeRequired=false
[2025-09-18 13:57:29.662629] bdev.c:8723:bdev_open_ext: *NOTICE*: Currently unable to find bdev with name: d83833ab97eacecdeb0a188f234799cfn1/v2test36-r-a9f41bfc
[longhorn-instance-manager] time="2025-09-18T13:57:29.674625164Z" level=info msg="Replica created a new head lvol" func="log.(*SafeLogger).Info" file="log.go:66" lvsName=d83833ab97eacecdeb0a188f234799cfn1 lvsUUID=fb68f02b-b1f2-4ac9-8b35-18621a8e7f93 replicaName=v2test36-r-a9f41bfc
[2025-09-18 13:57:29.718621] tcp.c: 759:nvmf_tcp_create: *NOTICE*: *** TCP Transport Init ***
[2025-09-18 13:57:29.742770] tcp.c:1103:nvmf_tcp_listen: *NOTICE*: *** NVMe/TCP Target Listening on 10.33.200.8 port 20001 ***
[longhorn-instance-manager] time="2025-09-18T13:57:29.751752709Z" level=info msg="Created replica" func="log.(*SafeLogger).Info" file="log.go:66" lvsName=d83833ab97eacecdeb0a188f234799cfn1 lvsUUID=fb68f02b-b1f2-4ac9-8b35-18621a8e7f93 replicaName=v2test36-r-a9f41bfc
[longhorn-instance-manager] time="2025-09-18T13:57:30.83629071Z" level=info msg="Creating instance" func="instance.(*Server).InstanceCreate" file="instance.go:116" dataEngine=DATA_ENGINE_V2 name=v2test36-e-0 type=engine upgradeRequired=false
[longhorn-instance-manager] time="2025-09-18T13:57:30.837017197Z" level=info msg="Creating engine" func="spdk.(*Engine).Create" file="engine.go:203" engineName=v2test36-e-0 frontend=spdk-tcp-blockdev initiatorAddress=10.33.200.8 portCount=1 replicaAddressMap="map[v2test36-r-23f3f887:10.33.200.4:20001 v2test36-r-a28dabf0:10.33.200.3:20001 v2test36-r-a9f41bfc:10.33.200.8:20001]" salvageRequested=false targetAddress=10.33.200.8 volumeName=v2test36
[longhorn-instance-manager] time="2025-09-18T13:57:30.840589422Z" level=info msg="Creating both initiator and target instances" func="log.(*SafeLogger).Info" file="log.go:66" engineName=v2test36-e-0 frontend=spdk-tcp-blockdev volumeName=v2test36
[2025-09-18 13:57:30.842613] bdev.c:8723:bdev_open_ext: *NOTICE*: Currently unable to find bdev with name: v2test36-e-0
[2025-09-18 13:57:30.850597] bdev_nvme.c:7088:spdk_bdev_nvme_delete: *ERROR*: Failed to find NVMe bdev controller
[2025-09-18 13:57:30.858604] bdev_nvme.c:6762:spdk_bdev_nvme_create: *NOTICE*: Updating global NVMe transport type (g_nvme_trtype) from PCIe to TCP (base-name: v2test36-r-a9f41bfc)
[2025-09-18 13:57:30.917166] nvme_transport.c: 580:nvme_qpair_connect_completion_cb: *NOTICE*: NVMe qpair 0x3522e00 connected successfully.

expected next line of log - build raid1 (from pool mode V2):

[longhorn-instance-manager] time="2025-09-18T12:44:08.448979351Z" level=info msg="Connecting all available replicas map[v2test33-r-007d8fdc:0xc001183a10 v2test33-r-67ddf076:0xc001183410 v2test33-r-b48a5efb:0xc001302bd0], then launching raid during engine creation" func="log.(*SafeLogger).Infof" file="log.go:73" engineName=v2test33-e-0 frontend=spdk-tcp-blockdev initiatorIP=10.33.200.4 replicaStatusMap="map[v2test33-r-007d8fdc:0xc001183a10 v2test33-r-67ddf076:0xc001183410 v2test33-r-b48a5efb:0xc001302bd0]" targetIP=10.33.200.4 volumeName=v2test33

To Reproduce

Expected Behavior

successful attachment

Support Bundle for Troubleshooting

Many tries in support bundle, the last one is volume "v2test36" cretaed "13:56:*", try to attach "13:57:*", try to delete "14:09:*", restarted v2 instance-managers "14:11:*", orphaned delete V2 volumes "14:13:*".

supportbundle_09afb238-c96d-4010-9f95-f6de59c721df_2025-09-18T14-20-06Z.zip

Environment

  • Longhorn version: v1.10.0-rc3
  • Impacted volume (PV): v2test36
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: harvester v1.6.0 -> rke2
    • Number of control plane nodes in the cluster: 3
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: SLE Micro 5.5 / Harvester v1.6.0
    • Kernel version: 5.14.21-150500.55.116-default
    • CPU per node: 8C/16T
    • Memory per node: >=64GB
    • Disk type (e.g. SSD/NVMe/HDD): 2xNVMe
    • Network bandwidth between the nodes (Gbps): LACP-2x2.5Gb/s
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal NVMe
  • Number of Longhorn volumes in the cluster: many V1, none V2

Additional context

No response

Workaround and Mitigation

No response

Metadata

Metadata

Assignees

Labels

area/spdkSPDK upstream/downstreamarea/v2-data-enginev2 data engine (SPDK)area/volume-attach-detachVolume attach & detach relatedbackport/1.10.1Require to backport to 1.10.1 release branchkind/bugpriority/0Must be implement or fixed in this release (managed by PO)require/auto-e2e-testRequire adding/updating auto e2e test cases if they can be automatedrequire/backportRequire backport. Only used when the specific versions to backport have not been definied.

Type

Projects

Status

Resolved

Status

Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions