Skip to content

Conversation

@kzalys
Copy link
Contributor

@kzalys kzalys commented Oct 31, 2025

For issue https://issues.apache.org/jira/browse/CASSANDRA-20996

The auto-repair history cleanup mechanism does not use LWTs for removing auto-repair history of stale nodes. This can lead to the following race condition:

  1. Node A sends out a query to vote for Node D to be removed from auto-repair history
  2. Node B also sends out a query to vote to remove Node D
  3. Node A's vote is carried out, it is now present in the auto-repair history table
  4. Node C sees that enough nodes have voted to remove Node D and sends out a query to delete the auto-repair history for Node D.
  5. The delete query is executed, a tombstone is inserted for Node D's auto-repair history
  6. Finally, Node B's vote to delete Node D is carried out, this vote query has a higher timestamp than the tombstone inserted by Node C. As a result, Node D's auto-repair history get resurrected by the vote query.

This PR introduces a unit test to simulate this out-of-order deletion/upsert and updates the auto-repair history mutations to use LWTs in order to prevent the race condition from happening.

@kzalys kzalys changed the title Use LWTs for all auto-repair history mutations CASSANDRA-20996 Use LWTs for all auto-repair history mutations Oct 31, 2025
@kzalys
Copy link
Contributor Author

kzalys commented Nov 11, 2025

@jaydeepkumar1984 can I have a review on this please?


// 2. Then, a vote to delete arrives after the row has already been deleted
AutoRepairUtils.addHostIdToDeleteHosts(repairType, votingNode, nodeToDelete);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also clear deleted hosts as part of the late event

AutoRepairUtils.clearDeleteHosts(repairType, nodeToDelete);

@@ -538,4 +538,42 @@ public void testSkipSystemTraces()
{
assertFalse(AutoRepairUtils.shouldConsiderKeyspace(Keyspace.open(SchemaConstants.TRACE_KEYSPACE_NAME)));
}

@Test
public void testAutoRepairHistoryOutOfOrderDeleteRaceCondition()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes even without any of the changes because ADD_HOST_ID_TO_DELETE_HOSTS already had IF EXISTS, however, the changes in this PR are necessary.
Please clarify in the description that the PR includes the test case and a certain cases we missed earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants