HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876
+51
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In our cluster with observer NNs, when the standby NN is doing a checkpoint and sending the fsimage to other NNs, if the sending fails of one NN due to network anomalies, NN restarts, or other exceptions, the standby will consider this Checkpoint as failed and does not update the lastCheckpointTime, and retry checkpoints.
However, the active or observer NNs which successfully received the fsimage has update their lastCheckpointTime, and the NN which receive fsimage failed don't update its lastCheckpointTime, resulting in inconsistent lastCheckpointTime across the NNs. This causes subsequent checkpoints to repeatedly fail to send fsimage to part or all active or observer NNs, because they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition.
Then the SNN will always failed to do checkpoint and repeat retry. I think that the SNN should consider the checkpoint successful and update its lastCheckpointTime if the fsimage transmission succeeds on at least half of the NNs.