HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

lfxy · 2025-08-16T16:33:54Z

In our cluster with observer NNs, when the standby NN is doing a checkpoint and sending the fsimage to other NNs, if the sending fails of one NN due to network anomalies, NN restarts, or other exceptions, the standby will consider this Checkpoint as failed and does not update the lastCheckpointTime, and retry checkpoints.
However, the active or observer NNs which successfully received the fsimage has update their lastCheckpointTime, and the NN which receive fsimage failed don't update its lastCheckpointTime, resulting in inconsistent lastCheckpointTime across the NNs. This causes subsequent checkpoints to repeatedly fail to send fsimage to part or all active or observer NNs, because they do not satisfy the DFS_NAMENODE_CHECKPOINT_PERIOD_KEY condition.
Then the SNN will always failed to do checkpoint and repeat retry. I think that the SNN should consider the checkpoint successful and update its lastCheckpointTime if the fsimage transmission succeeds on at least half of the NNs.

…ne of the multiple NNs

hadoop-yetus · 2025-08-17T18:50:39Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	9m 45s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	25m 47s		trunk passed
+1 💚	compile	0m 44s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	0m 39s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	0m 39s		trunk passed
+1 💚	mvnsite	0m 44s		trunk passed
+1 💚	javadoc	0m 43s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 7s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	1m 46s		trunk passed
+1 💚	shadedclient	22m 4s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
+1 💚	compile	0m 38s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 38s		the patch passed
+1 💚	compile	0m 34s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	0m 34s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 29s		the patch passed
+1 💚	mvnsite	0m 37s		the patch passed
+1 💚	javadoc	0m 34s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 1s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	1m 40s		the patch passed
+1 💚	shadedclient	21m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	117m 28s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 30s		The patch does not generate ASF License warnings.
		209m 14s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/artifact/out/Dockerfile
GITHUB PR	#7876
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 3f6ba28b028c 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `867b137`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/testReport/
Max. process+thread count	4084 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7876/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

github-actions bot added HDFS trunk labels Aug 16, 2025

Fix the SNN repeatedly checkpoint after fsimage transfer failure on o…

867b137

…ne of the multiple NNs

lfxy force-pushed the feature/HDFS-17821 branch from 3bf0ae1 to 867b137 Compare August 17, 2025 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

lfxy commented Aug 16, 2025

Uh oh!

hadoop-yetus commented Aug 17, 2025

Uh oh!

Uh oh!

HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

Are you sure you want to change the base?

HDFS-17821. Fix the SNN repeatedly checkpoint after fsimage transfer failure on one of the multiple NNs #7876

Conversation

lfxy commented Aug 16, 2025

Uh oh!

hadoop-yetus commented Aug 17, 2025

Uh oh!

Uh oh!