Skip to content

HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

lfxy
Copy link
Contributor

@lfxy lfxy commented Aug 16, 2025

In our cluster with observer NNs, when the standby NN is doing a checkpoint and sending the fsimage to other NNs, if the sending fails of one NN due to network anomalies, NN restarts, or other exceptions, the standby will consider this Checkpoint as failed and does not update the lastCheckpointTime, and retry checkpoints.
However, the active or observer NNs which successfully received the fsimage has update their lastCheckpointTime, and the NN which receive fsimage failed don't update its lastCheckpointTime, resulting in inconsistent lastCheckpointTime across the NNs. This causes subsequent checkpoints to repeatedly fail to send fsimage to part or all active or observer NNs, because they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition.
Then the SNN will always failed to do checkpoint and repeat retry. I think that the SNN should consider the checkpoint successful and update its lastCheckpointTime if the fsimage transmission succeeds on at least half of the NNs.

@lfxy lfxy force-pushed the feature/HDFS-17821 branch from 3bf0ae1 to 867b137 Compare August 17, 2025 15:20
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 9m 45s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 25m 47s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 39s trunk passed
+1 💚 mvnsite 0m 44s trunk passed
+1 💚 javadoc 0m 43s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 7s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 46s trunk passed
+1 💚 shadedclient 22m 4s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 38s the patch passed
+1 💚 compile 0m 38s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 38s the patch passed
+1 💚 compile 0m 34s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 34s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 29s the patch passed
+1 💚 mvnsite 0m 37s the patch passed
+1 💚 javadoc 0m 34s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 1m 1s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 40s the patch passed
+1 💚 shadedclient 21m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 117m 28s hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 30s The patch does not generate ASF License warnings.
209m 14s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/artifact/out/Dockerfile
GITHUB PR #7876
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3f6ba28b028c 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 867b137
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/testReport/
Max. process+thread count 4084 (vs. ulimit of 5500)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants