feat: implement gpu health check script, without organizing code structure #764

Eric-Wei-10 · 2025-09-03T09:41:51Z

Implement a gpu health check script, including:

Testing communication within tp/pp/dp subgroups.
Testing GPU hareward information
Testing GPU calculation functionality

aoyulong · 2025-09-05T01:14:31Z

@gemini-code-assist, please review this pr in details

gemini-code-assist

Code Review

This pull request introduces a GPU health check script. The script is well-structured and covers communication, hardware, and calculation tests. I've identified a few areas for improvement, including removing duplicated code, correcting a misleading help message and an error message, and refactoring a function to reduce code duplication. These changes will improve the script's clarity and maintainability.

gemini-code-assist · 2025-09-05T01:15:37Z

gpu_health_check.py

+    rank = dist.get_rank()
+    device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')


These lines are duplicates of lines 272-273 and can be removed to avoid redundancy.

gemini-code-assist · 2025-09-05T01:15:37Z

gpu_health_check.py

+        '--tensor-model-parallel-size',
+        type=int,
+        default=1,
+        help='Degree of pipeline model parallelism.',


The help message for --tensor-model-parallel-size is incorrect and seems to be a copy-paste from --pipeline-model-parallel-size. It should describe tensor model parallelism to avoid confusion.

Suggested change

help='Degree of pipeline model parallelism.',

help='Degree of tensor model parallelism.',

gemini-code-assist · 2025-09-05T01:15:38Z

gpu_health_check.py

+        result2 = torch.inverse(c)
+
+        if torch.any(torch.isnan(result1)) or torch.any(torch.isnan(result2)):
+            print(f"test_calculation_float failed: nan detected in iteration {iteration}")


The error message here incorrectly states that test_calculation_float failed. It should be test_calculation_endurance to provide accurate information for debugging.

Suggested change

print(f"test_calculation_float failed: nan detected in iteration {iteration}")

print(f"test_calculation_endurance failed: nan detected in iteration {iteration}")

gemini-code-assist · 2025-09-05T01:15:38Z

gpu_health_check.py

+    args = get_args()
+
+    result = test_calculation_float()
+    result_tensor = torch.zeros(args.world_size).cuda()
+    result_tensor[args.rank] = 1.0 if result else 0.0
+    dist.all_reduce(result_tensor, dist.ReduceOp.SUM)
+    if args.rank == 0:
+        check_test_result("test_calculation_float", result_tensor)
+
+    result = test_calculation_double()
+    result_tensor = torch.zeros(args.world_size).cuda()
+    result_tensor[args.rank] = 1.0 if result else 0.0
+    dist.all_reduce(result_tensor, dist.ReduceOp.SUM)
+    if args.rank == 0:
+        check_test_result("test_calculation_double", result_tensor)
+
+    result = test_calculation_half()
+    result_tensor = torch.zeros(args.world_size).cuda()
+    result_tensor[args.rank] = 1.0 if result else 0.0
+    dist.all_reduce(result_tensor, dist.ReduceOp.SUM)
+    if args.rank == 0:
+        check_test_result("test_calculation_half", result_tensor)
+
+    result = test_calculation_endurance()
+    result_tensor = torch.zeros(args.world_size).cuda()
+    result_tensor[args.rank] = 1.0 if result else 0.0
+    dist.all_reduce(result_tensor, dist.ReduceOp.SUM)
+    if args.rank == 0:
+        check_test_result("test_calculation_endurance", result_tensor)


The logic for running each calculation test and checking the result is repeated four times. This can be refactored into a loop to reduce code duplication and improve maintainability.

Suggested change

args = get_args()

result = test_calculation_float()

result_tensor = torch.zeros(args.world_size).cuda()

result_tensor[args.rank] = 1.0 if result else 0.0

dist.all_reduce(result_tensor, dist.ReduceOp.SUM)

if args.rank == 0:

check_test_result("test_calculation_float", result_tensor)

result = test_calculation_double()

result_tensor = torch.zeros(args.world_size).cuda()

result_tensor[args.rank] = 1.0 if result else 0.0

dist.all_reduce(result_tensor, dist.ReduceOp.SUM)

if args.rank == 0:

check_test_result("test_calculation_double", result_tensor)

result = test_calculation_half()

result_tensor = torch.zeros(args.world_size).cuda()

result_tensor[args.rank] = 1.0 if result else 0.0

dist.all_reduce(result_tensor, dist.ReduceOp.SUM)

if args.rank == 0:

check_test_result("test_calculation_half", result_tensor)

result = test_calculation_endurance()

result_tensor = torch.zeros(args.world_size).cuda()

result_tensor[args.rank] = 1.0 if result else 0.0

dist.all_reduce(result_tensor, dist.ReduceOp.SUM)

if args.rank == 0:

check_test_result("test_calculation_endurance", result_tensor)

args = get_args()

calculation_tests = {

"test_calculation_float": test_calculation_float,

"test_calculation_double": test_calculation_double,

"test_calculation_half": test_calculation_half,

"test_calculation_endurance": test_calculation_endurance,

}

for test_name, test_func in calculation_tests.items():

result = test_func()

result_tensor = torch.zeros(args.world_size).cuda()

result_tensor[args.rank] = 1.0 if result else 0.0

dist.all_reduce(result_tensor, dist.ReduceOp.SUM)

if args.rank == 0:

check_test_result(test_name, result_tensor)

Implement gpu health check script, without organizing code structure

62811d7

Eric-Wei-10 requested a review from a team as a code owner September 3, 2025 09:41

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement gpu health check script, without organizing code structure #764

feat: implement gpu health check script, without organizing code structure #764

Uh oh!

Eric-Wei-10 commented Sep 3, 2025

Uh oh!

aoyulong commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

gemini-code-assist bot Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		rank = dist.get_rank()
		device = torch.device(f'cuda:{rank % torch.cuda.device_count()}')

	help='Degree of pipeline model parallelism.',
	help='Degree of tensor model parallelism.',

	print(f"test_calculation_float failed: nan detected in iteration {iteration}")
	print(f"test_calculation_endurance failed: nan detected in iteration {iteration}")

feat: implement gpu health check script, without organizing code structure #764

Are you sure you want to change the base?

feat: implement gpu health check script, without organizing code structure #764

Uh oh!

Conversation

Eric-Wei-10 commented Sep 3, 2025

Uh oh!

aoyulong commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants