Skip to content

Partition ownership overlap during unideal circumstances #732

@tobiajo

Description

@tobiajo

Description

I have observed, that during unideal circumstances, like Kafka maintenance, network issues or many restarts during high load. There is a possiblity of having multiple activate PartitionFlow for the same partition and therefore multiple snapshot store writers for the same key. In worst case this can lead to newer snapshots being overridden by older snapshots.

On consumer side there is proper partition ownership gauranteed by Kafka, but when a partition is revoked on session timeout, there is a delay until the previous owner gets aware of it and the partition flow is stopped. In this time window there is a risk of introduced data corruption. First when the previous owner polls or commit, it will discover being kicked out.

Scenario

Here is a figure describing what can happen:

sequenceDiagram
    autonumber

    participant inputTopic
    participant nodeA
    participant nodeB
    participant nodeC
    participant snapshotStore

    inputTopic-->>+nodeA: assign partition 3
    nodeA->>snapshotStore: read
    snapshotStore-->>nodeA: state_0
    nodeA->>inputTopic: poll events
    inputTopic-->>nodeA: seqNr 1 to 5
    nodeA->>nodeA: state_5 = state_0 + events seqNr 1 to 5
    Note over inputTopic, nodeA: unideal circumstances start,<br/>nodeA revoked from partition 3

    inputTopic-->>+nodeB: assign partition 3
    nodeB->>snapshotStore: read
    snapshotStore-->>nodeB: state_0
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 1 to 5
    nodeB->>nodeB: state_5 = state_0 + events seqNr 1 to 5
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 6 to 10
    nodeB->>nodeB: state_10 = state_5 + events seqNr 6 to 10
    nodeB->>snapshotStore: write state_10
    nodeB->>inputTopic: commit seqNr 10

    nodeA->>snapshotStore: ⚠️ write state_5
    nodeA->>inputTopic: ❌ commit seqNr 5
    inputTopic-->>nodeA: not part of active group
    nodeA->>-nodeA: stop partition flow

    nodeB->>-nodeB: stop partition flow

    inputTopic-->>+nodeC: assign partition 3
    nodeC->>snapshotStore: read
    snapshotStore-->>nodeC: state_5
    nodeC->>inputTopic: poll events
    inputTopic-->>nodeC: seqNr 11 to 15
    nodeC->>-nodeC: 💥 corrupt = state_5 + events seqNr 11 to 15
Loading

In a very extreme case, similar to the scenario in the figure, I observed that nodeA stopped the partition flow first 45 seconds after nodeB started it, for the same partition. As nodeA does not become aware being kicked out before trying to commit offset (19) and getting an error response (20).

Solution

I don't have any straightforward solution to the problem, but the main issue that needs to be prevented is stale writes. Here are some ideas:

  1. Transactional snapshot write + offset commit
    If a snapshot could only be written when the consumer can commit, it would ensure that only the proper partition owner can write.
  2. Transactional snapshot read + snapshot write
    If the value in snapshot store is asserted to be the previous, stale writes could be prevented.
  3. Offset tracking in snapshot
    If snapshots contained offset, on recovery the lowest offset accross all keys in the partition could be used.
  4. Distributed lock
    If partition ownership could be fully exclusive through some lease, multiple writers would be impossible.

Workarounds

In the meanwhile, I would appriciate any ideas for workarounds. The only thing I can I can think of is using static partition assignments, to make it impossible that multiple nodes are active for the same partition.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions