Skip to content

Conversation

@rsalagame-nvidia
Copy link

  • Fix virtual_pipeline_model_parallel_size check from '!= -1' to 'is not None'
  • Update vp default value from -1 to None for consistency
  • Add checkpoint save/load configuration handling
  • Enable save_dir, load_dir, save_interval, most_recent_k options
  • Wire up checkpoint arguments to recipe configuration

Manually applied changes from commits:

- Fix virtual_pipeline_model_parallel_size check from '!= -1' to 'is not None'
- Update vp default value from -1 to None for consistency
- Add checkpoint save/load configuration handling
- Enable save_dir, load_dir, save_interval, most_recent_k options
- Wire up checkpoint arguments to recipe configuration

Manually applied changes from commits:
- 5e78174 (vp check fix)
- 4c9881f (vp default value)
- 853348f/b25d66d3 (checkpoint functionality)
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Rahul Salagame added 7 commits January 21, 2026 17:56
- Add nsys_trace=["cuda"] for CUDA tracing
- Add nsys_extra_args with detailed profiling options:
  - Force overwrite existing profiles
  - Capture range using CUDA profiler API
  - Enable CUDA graph tracing at node level
  - Disable CUDA event trace to reduce overhead
  - Include NCCL domain in NVTX traces

Partially adapted from commit a514e44
- Profile all ranks by default when profiling_ranks is None
- Maintains user override capability via --profiling_ranks argument
- Matches behavior from original commit a514e44
- Use default directory /nemo_run/code/nemo_experiments/default/checkpoints
- Only when save_interval is set but save_dir is not specified
- Matches behavior from original checkpoint implementation
- Checkpoint save enabled by default to /nemo_run/code/nemo_experiments/default/checkpoints
- Save interval defaults to train_iters (checkpoint at end of training)
- Users can override via --save_dir and --save_interval arguments
- Removes checkpoint.save = None from common perf overrides
- Remove conditional logic for profiling_ranks
- Always profile all GPUs: profile_ranks=list(range(num_gpus))
- Simplifies configuration as all ranks should always be profiled
- When --save_interval is provided, enable checkpointing with default or specified save_dir
- Default save_dir to /nemo_run/code/nemo_experiments/default/checkpoints when not specified
- Default save_interval to train_iters when only save_dir is provided
- Removed manual train_iters=1 setting for checkpoint loading (handled by --max_steps from bash script)
- Aligns with ENABLE_CHECKPOINT behavior in bash wrapper script
@rsalagame-nvidia rsalagame-nvidia added performance performance/release Performance items related with NeMo release labels Jan 22, 2026
@rsalagame-nvidia rsalagame-nvidia added this to the 26.02 milestone Jan 22, 2026
@rsalagame-nvidia rsalagame-nvidia self-assigned this Jan 22, 2026
recipe.checkpoint.save = args.save_dir
logger.info(f"Checkpoint save directory set to: {args.save_dir}")
else:
recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints"
recipe.checkpoint.save = "/nemo_run/checkpoints"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malay-nagda We want the default location of the checkpoints to be under the code/nemo_experiments/default folder. This is to keep consistent with other workloads (ex: nemotron4, etc...).

Rahul Salagame added 3 commits January 23, 2026 20:27
- Change default from None to -1 in argument_parser.py
- Change condition from hasattr() to != -1 in overrides.py
- Allows users to explicitly set vp to None when needed
- Distinguishes between 'not provided' and 'explicitly None'
- Created _set_checkpoint_overrides() to handle all checkpoint save/load logic
- Simplified set_user_overrides() by calling the new method
- Improves code organization and reusability
- Consistent with other helper methods like _set_megatron_fsdp_overrides()
- Add --nsys_trace CLI arg to customize traced events (defaults to nemo_run defaults)
- Add --nsys_extra_args CLI arg to provide additional nsys arguments
- Fix profile_ranks to use --profiling_ranks from CLI instead of hardcoding to all ranks
- Update NsysPlugin to combine default extra args with user-provided args
- Preserve default behavior: profiles rank 0 only when --profiling_ranks not specified

This fixes the issue where --profiling_ranks was being ignored and all ranks
were always profiled, and allows users to customize nsys trace and extra args.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance/release Performance items related with NeMo release performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants