-
Notifications
You must be signed in to change notification settings - Fork 111
feat: implement gpu health check script, without organizing code structure #764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@gemini-code-assist, please review this pr in details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a GPU health check script. The script is well-structured and covers communication, hardware, and calculation tests. I've identified a few areas for improvement, including removing duplicated code, correcting a misleading help message and an error message, and refactoring a function to reduce code duplication. These changes will improve the script's clarity and maintainability.
rank = dist.get_rank() | ||
device = torch.device(f'cuda:{rank % torch.cuda.device_count()}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'--tensor-model-parallel-size', | ||
type=int, | ||
default=1, | ||
help='Degree of pipeline model parallelism.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result2 = torch.inverse(c) | ||
|
||
if torch.any(torch.isnan(result1)) or torch.any(torch.isnan(result2)): | ||
print(f"test_calculation_float failed: nan detected in iteration {iteration}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message here incorrectly states that test_calculation_float
failed. It should be test_calculation_endurance
to provide accurate information for debugging.
print(f"test_calculation_float failed: nan detected in iteration {iteration}") | |
print(f"test_calculation_endurance failed: nan detected in iteration {iteration}") |
args = get_args() | ||
|
||
result = test_calculation_float() | ||
result_tensor = torch.zeros(args.world_size).cuda() | ||
result_tensor[args.rank] = 1.0 if result else 0.0 | ||
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | ||
if args.rank == 0: | ||
check_test_result("test_calculation_float", result_tensor) | ||
|
||
result = test_calculation_double() | ||
result_tensor = torch.zeros(args.world_size).cuda() | ||
result_tensor[args.rank] = 1.0 if result else 0.0 | ||
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | ||
if args.rank == 0: | ||
check_test_result("test_calculation_double", result_tensor) | ||
|
||
result = test_calculation_half() | ||
result_tensor = torch.zeros(args.world_size).cuda() | ||
result_tensor[args.rank] = 1.0 if result else 0.0 | ||
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | ||
if args.rank == 0: | ||
check_test_result("test_calculation_half", result_tensor) | ||
|
||
result = test_calculation_endurance() | ||
result_tensor = torch.zeros(args.world_size).cuda() | ||
result_tensor[args.rank] = 1.0 if result else 0.0 | ||
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | ||
if args.rank == 0: | ||
check_test_result("test_calculation_endurance", result_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for running each calculation test and checking the result is repeated four times. This can be refactored into a loop to reduce code duplication and improve maintainability.
args = get_args() | |
result = test_calculation_float() | |
result_tensor = torch.zeros(args.world_size).cuda() | |
result_tensor[args.rank] = 1.0 if result else 0.0 | |
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | |
if args.rank == 0: | |
check_test_result("test_calculation_float", result_tensor) | |
result = test_calculation_double() | |
result_tensor = torch.zeros(args.world_size).cuda() | |
result_tensor[args.rank] = 1.0 if result else 0.0 | |
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | |
if args.rank == 0: | |
check_test_result("test_calculation_double", result_tensor) | |
result = test_calculation_half() | |
result_tensor = torch.zeros(args.world_size).cuda() | |
result_tensor[args.rank] = 1.0 if result else 0.0 | |
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | |
if args.rank == 0: | |
check_test_result("test_calculation_half", result_tensor) | |
result = test_calculation_endurance() | |
result_tensor = torch.zeros(args.world_size).cuda() | |
result_tensor[args.rank] = 1.0 if result else 0.0 | |
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | |
if args.rank == 0: | |
check_test_result("test_calculation_endurance", result_tensor) | |
args = get_args() | |
calculation_tests = { | |
"test_calculation_float": test_calculation_float, | |
"test_calculation_double": test_calculation_double, | |
"test_calculation_half": test_calculation_half, | |
"test_calculation_endurance": test_calculation_endurance, | |
} | |
for test_name, test_func in calculation_tests.items(): | |
result = test_func() | |
result_tensor = torch.zeros(args.world_size).cuda() | |
result_tensor[args.rank] = 1.0 if result else 0.0 | |
dist.all_reduce(result_tensor, dist.ReduceOp.SUM) | |
if args.rank == 0: | |
check_test_result(test_name, result_tensor) |
Implement a gpu health check script, including: