Partition ownership overlap during unideal circumstances

### Description

I have observed, that during unideal circumstances, like Kafka maintenance, network issues or many restarts during high load. There is a possiblity of having multiple activate `PartitionFlow` for the same partition and therefore multiple snapshot store writers for the same key. In worst case this can lead to newer snapshots being overridden by older snapshots.

On consumer side there is proper partition ownership gauranteed by Kafka, but when a partition is revoked on _session timeout_, there is a delay until the previous owner gets aware of it and the partition flow is stopped. In this time window there is a risk of introduced data corruption. First when the previous owner polls or commit, it will discover being kicked out.


### Scenario
Here is a figure describing what can happen:
```mermaid
sequenceDiagram
    autonumber

    participant inputTopic
    participant nodeA
    participant nodeB
    participant nodeC
    participant snapshotStore

    inputTopic-->>+nodeA: assign partition 3
    nodeA->>snapshotStore: read
    snapshotStore-->>nodeA: state_0
    nodeA->>inputTopic: poll events
    inputTopic-->>nodeA: seqNr 1 to 5
    nodeA->>nodeA: state_5 = state_0 + events seqNr 1 to 5
    Note over inputTopic, nodeA: unideal circumstances start,<br/>nodeA revoked from partition 3

    inputTopic-->>+nodeB: assign partition 3
    nodeB->>snapshotStore: read
    snapshotStore-->>nodeB: state_0
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 1 to 5
    nodeB->>nodeB: state_5 = state_0 + events seqNr 1 to 5
    nodeB->>inputTopic: poll events
    inputTopic-->>nodeB: seqNr 6 to 10
    nodeB->>nodeB: state_10 = state_5 + events seqNr 6 to 10
    nodeB->>snapshotStore: write state_10
    nodeB->>inputTopic: commit seqNr 10

    nodeA->>snapshotStore: ⚠️ write state_5
    nodeA->>inputTopic: ❌ commit seqNr 5
    inputTopic-->>nodeA: not part of active group
    nodeA->>-nodeA: stop partition flow

    nodeB->>-nodeB: stop partition flow

    inputTopic-->>+nodeC: assign partition 3
    nodeC->>snapshotStore: read
    snapshotStore-->>nodeC: state_5
    nodeC->>inputTopic: poll events
    inputTopic-->>nodeC: seqNr 11 to 15
    nodeC->>-nodeC: 💥 corrupt = state_5 + events seqNr 11 to 15
```

In a very extreme case, similar to the scenario in the figure, I observed that `nodeA` stopped the partition flow first 45 seconds after `nodeB` started it, for the same partition. As `nodeA` does not become aware being kicked out before trying to commit offset (19) and getting an error response (20).


### Solution

I don't have any straightforward solution to the problem, but the main issue that needs to be prevented is _stale writes_. Here are some ideas:

1. **Transactional snapshot write + offset commit**
If a snapshot could only be written when the consumer can commit, it would ensure that only the proper partition owner can write.
3. **Transactional snapshot read + snapshot write**
If the value in snapshot store is asserted to be the previous, stale writes could be prevented.
2. **Offset tracking in snapshot**
If snapshots contained offset, on recovery the lowest offset accross all keys in the partition could be used.
4. **Distributed lock**
If partition ownership could be fully exclusive through some lease, multiple writers would be impossible.


### Workarounds

In the meanwhile, I would appriciate any ideas for workarounds. The only thing I can I can think of is using static partition assignments, to make it impossible that multiple nodes are active for the same partition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition ownership overlap during unideal circumstances #732

Description

Scenario

Solution

Workarounds

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partition ownership overlap during unideal circumstances #732

Description

Description

Scenario

Solution

Workarounds

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions