Skip to content

Deadlock when creating networks in parallel in a cluster #18023

@axinojolais

Description

@axinojolais

Please confirm

  • I have searched existing issues to check if an issue already exists for the bug I encountered.

Distribution

Ubuntu

Distribution version

24.04.2

Output of "snap list --all lxd core20 core22 core24 snapd"

Name    Version         Rev    Tracking       Publisher   Notes
core22  20260128        2339   latest/stable  canonical✓  base,disabled
core22  20260225        2411   latest/stable  canonical✓  base
core24  20260211        1499   latest/stable  canonical✓  base,disabled
core24  20260317        1587   latest/stable  canonical✓  base
lxd     5.21.4-8caf727  37923  5.21/edge      canonical✓  disabled,in-cohort,held
lxd     git-b7b411c     38243  5.21/edge      canonical✓  in-cohort,held
snapd   2.73            25935  latest/stable  canonical✓  snapd,disabled
snapd   2.74.1          26382  latest/stable  canonical✓  snapd

Output of "lxc info" or system info if it fails

(I just found a secret in "lxc info" so I won't share the output)

lxd version git-b7b411c (5.21/edge) - I don't think the rest is helpful here

Issue description

We have a Microcloud cluster with 3 nodes.
The LXD API URL is https://domain.foo:8443/, with domain.foo resolving to the 3 IPs of the nodes.
We drive LXD through Terraform, and our plan was at this stage creating, among other thing, two networks.
Applying the plan sometimes worked, but most of the time the network creations were timing out.

Steps to reproduce

@tomponline is aware and found the likely culprit, quoting him : "a deadlock would be possible in a clustered env as https://github.com/canonical/lxd/blob/main/lxd/networks.go#L455 is taken for both external api requests and internal notification requests spawned from an external request. So if two networks are being created at same time on different members this will lock. This will be best solved by cluster wide locking, which will come in the 6.x series, but will not be in 5.21."

The workaround is to only talk to the API via a single node,

Information to attach

  • Any relevant kernel output (dmesg)
  • Instance log (lxc info NAME --show-log)
  • Instance configuration (lxc config show NAME --expanded)
  • Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (or use lxc monitor while reproducing the issue)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions