Fix vp check and add checkpoint configuration support #2020

rsalagame-nvidia · 2026-01-22T00:49:42Z

Fix virtual_pipeline_model_parallel_size check from '!= -1' to 'is not None'
Update vp default value from -1 to None for consistency
Add checkpoint save/load configuration handling
Enable save_dir, load_dir, save_interval, most_recent_k options
Wire up checkpoint arguments to recipe configuration

Manually applied changes from commits:

5e78174 (vp check fix)
4c9881f (vp default value)
853348f/b25d66d3 (checkpoint functionality)

- Fix virtual_pipeline_model_parallel_size check from '!= -1' to 'is not None' - Update vp default value from -1 to None for consistency - Add checkpoint save/load configuration handling - Enable save_dir, load_dir, save_interval, most_recent_k options - Wire up checkpoint arguments to recipe configuration Manually applied changes from commits: - 5e78174 (vp check fix) - 4c9881f (vp default value) - 853348f/b25d66d3 (checkpoint functionality)

copy-pr-bot · 2026-01-22T00:49:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Add nsys_trace=["cuda"] for CUDA tracing - Add nsys_extra_args with detailed profiling options: - Force overwrite existing profiles - Capture range using CUDA profiler API - Enable CUDA graph tracing at node level - Disable CUDA event trace to reduce overhead - Include NCCL domain in NVTX traces Partially adapted from commit a514e44

- Profile all ranks by default when profiling_ranks is None - Maintains user override capability via --profiling_ranks argument - Matches behavior from original commit a514e44

- Use default directory /nemo_run/code/nemo_experiments/default/checkpoints - Only when save_interval is set but save_dir is not specified - Matches behavior from original checkpoint implementation

- Checkpoint save enabled by default to /nemo_run/code/nemo_experiments/default/checkpoints - Save interval defaults to train_iters (checkpoint at end of training) - Users can override via --save_dir and --save_interval arguments - Removes checkpoint.save = None from common perf overrides

- Remove conditional logic for profiling_ranks - Always profile all GPUs: profile_ranks=list(range(num_gpus)) - Simplifies configuration as all ranks should always be profiled

…r is provided

- When --save_interval is provided, enable checkpointing with default or specified save_dir - Default save_dir to /nemo_run/code/nemo_experiments/default/checkpoints when not specified - Default save_interval to train_iters when only save_dir is provided - Removed manual train_iters=1 setting for checkpoint loading (handled by --max_steps from bash script) - Aligns with ENABLE_CHECKPOINT behavior in bash wrapper script

scripts/performance/utils/overrides.py

malay-nagda · 2026-01-22T08:46:50Z

scripts/performance/utils/overrides.py

+            recipe.checkpoint.save = args.save_dir
+            logger.info(f"Checkpoint save directory set to: {args.save_dir}")
+        else:
+            recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints"


Suggested change

recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints"

recipe.checkpoint.save = "/nemo_run/checkpoints"

@malay-nagda We want the default location of the checkpoints to be under the code/nemo_experiments/default folder. This is to keep consistent with other workloads (ex: nemotron4, etc...).

scripts/performance/utils/overrides.py

scripts/performance/setup_experiment.py

- Change default from None to -1 in argument_parser.py - Change condition from hasattr() to != -1 in overrides.py - Allows users to explicitly set vp to None when needed - Distinguishes between 'not provided' and 'explicitly None'

- Created _set_checkpoint_overrides() to handle all checkpoint save/load logic - Simplified set_user_overrides() by calling the new method - Improves code organization and reusability - Consistent with other helper methods like _set_megatron_fsdp_overrides()

- Add --nsys_trace CLI arg to customize traced events (defaults to nemo_run defaults) - Add --nsys_extra_args CLI arg to provide additional nsys arguments - Fix profile_ranks to use --profiling_ranks from CLI instead of hardcoding to all ranks - Update NsysPlugin to combine default extra args with user-provided args - Preserve default behavior: profiles rank 0 only when --profiling_ranks not specified This fixes the issue where --profiling_ranks was being ignored and all ranks were always profiled, and allows users to customize nsys trace and extra args.

rsalagame-nvidia requested review from a team, erhoo82 and malay-nagda as code owners January 22, 2026 00:49

Rahul Salagame added 7 commits January 21, 2026 17:56

Fix nsys profiling to default to all ranks

01dc695

- Profile all ranks by default when profiling_ranks is None - Maintains user override capability via --profiling_ranks argument - Matches behavior from original commit a514e44

Add default checkpoint directory when save_interval is specified

e081bf1

- Use default directory /nemo_run/code/nemo_experiments/default/checkpoints - Only when save_interval is set but save_dir is not specified - Matches behavior from original checkpoint implementation

Simplify nsys profiling to always profile all ranks

0a3ff91

- Remove conditional logic for profiling_ranks - Always profile all GPUs: profile_ranks=list(range(num_gpus)) - Simplifies configuration as all ranks should always be profiled

Revert default checkpointing to disabled - only enable when --save_di…

f289054

…r is provided

rsalagame-nvidia added performance performance/release Performance items related with NeMo release labels Jan 22, 2026

rsalagame-nvidia added this to the 26.02 milestone Jan 22, 2026

rsalagame-nvidia self-assigned this Jan 22, 2026

scsudhakaran reviewed Jan 22, 2026

View reviewed changes

scripts/performance/utils/overrides.py Outdated Show resolved Hide resolved

malay-nagda reviewed Jan 22, 2026

View reviewed changes

Rahul Salagame added 3 commits January 23, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vp check and add checkpoint configuration support #2020

Fix vp check and add checkpoint configuration support #2020

rsalagame-nvidia commented Jan 22, 2026

Uh oh!

copy-pr-bot bot commented Jan 22, 2026

Uh oh!

Uh oh!

malay-nagda Jan 22, 2026

Uh oh!

rsalagame-nvidia Jan 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints"
	recipe.checkpoint.save = "/nemo_run/checkpoints"

Fix vp check and add checkpoint configuration support #2020

Are you sure you want to change the base?

Fix vp check and add checkpoint configuration support #2020

Conversation

rsalagame-nvidia commented Jan 22, 2026

Uh oh!

copy-pr-bot bot commented Jan 22, 2026

Uh oh!

Uh oh!

malay-nagda Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

rsalagame-nvidia Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants