Skip to content

Conversation

ansschh
Copy link

@ansschh ansschh commented Oct 3, 2025

unit tests for torch distributed initialization utilities.

Tests Added:

  • suppress_output for different ranks
  • init_distributed with single/multi-GPU
  • CUDA availability and device validation
  • NCCL warmup and error handling
  • Cleanup on exceptions

ansschh added 2 commits August 9, 2025 11:25
- Add return type hint to get_tokenizer()
- Add type hints and checkpoint validation to generate.py main()
- Add parameter type hints to suppress_output() in torch/utils.py

Improves IDE support and catches potential bugs early.
- Test suppress_output for different ranks
- Test init_distributed with single/multi-GPU
- Test CUDA availability and device validation
- Test NCCL warmup and error handling
- Test cleanup on exceptions
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Comment on lines +52 to +55
mock_torch.cuda.is_available.return_value = False

with pytest.raises(RuntimeError, match="CUDA is not available"):
init_distributed()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Align torch utils tests with actual init_distributed behavior

The new tests around init_distributed expect the helper to raise descriptive RuntimeErrors when CUDA is unavailable and to perform cleanup on various failures (e.g. rank out of range, NCCL warm‑up errors). However, gpt_oss/torch/utils.py:init_distributed does not call torch.cuda.is_available, check torch.cuda.device_count, or wrap its warm‑up code in try/except, so it will happily return even when CUDA is mocked as unavailable and will never invoke dist.destroy_process_group. As written, test_init_distributed_cuda_not_available_raises (and the similar tests that follow) will fail because the function under test has no such error handling. Either extend init_distributed to implement these safety checks and cleanup logic or relax the assertions in these tests.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant