Skip to content

On offline node, cancelled context doesn't terminate transaction #877

@skozina

Description

@skozina

I've noticed that on an offline node, if a transaction context is cancelled, the transaction is not terminated. It only terminates when the node gets back online.
More specifically, this is how I noticed this behavior:

  1. Create a cluster of 3 nodes
  2. Disconnect node1 from the other nodes (I've done this with ip link set enp5s0 down on the node1 VM, where enp5s0 was the only network interface used for the dqlite connection)
  3. Initiate a write transaction from this node (the specific SQL command was UPDATE operations SET updated_at = ?, status_code = ?, metadata = ?, error = ?, error_code = ? WHERE uuid = ?)
  4. Cancel the context used for the transacion
  5. Notice that the transaction doesn't end immediately

It doesn't matter if node1 was initially database leader or not. I can see the same behavior with both the leader or a voter node.
I've reproduced this only with custom debugging output with LXD. The respective debug outputs after the node was disconnected from the rest of the cluster:

# Initiate the write transaction
ERROR  [2026-03-30T19:06:53Z] Committing operation metadata to database with context  context="&{0x9cf01459dc0 0x109ad00}" operation=019d4024-38c2-7a93-a643-e1330f3ed689
WARNING[2026-03-30T19:07:03Z] Transaction timed out. Retrying once          err="Failed beginning transaction: context deadline exceeded" member=3
ERROR  [2026-03-30T19:07:07Z] Heartbeat not received in time, cancelling durable operations 
# This is when the transaction context is cancelled
ERROR  [2026-03-30T19:07:07Z] Cancelling durable operation due to missed heartbeat context  operation=019d4024-38c2-7a93-a643-e1330f3ed689
# This is when the transaction actually terminates.
ERROR  [2026-03-30T19:07:13Z] Finished committing operation metadata to database with context  context="&{0x9cf01459dc0 0x109ad00}" err="Failed updating operation \"019d4024-38c2-7a93-a643-e1330f3ed689\" record: Failed beginning transaction: failed to create dqlite connection: no available dqlite leader server found" operation=019d4024-38c2-7a93-a643-e1330f3ed689    

Note the 5s difference in time (19:07:07 until 19:07:13), and the error message ("failed to create dqlite connection: no available dqlite leader server found"). This shows that the transaction was not terminated because the context was cancelled, but because it failed to reach the cluster leader.

Is this intended behavior? Should the transaction terminate immediate when the context is cancelled?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions