-
Notifications
You must be signed in to change notification settings - Fork 148
Fix vp check and add checkpoint configuration support #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fix virtual_pipeline_model_parallel_size check from '!= -1' to 'is not None' - Update vp default value from -1 to None for consistency - Add checkpoint save/load configuration handling - Enable save_dir, load_dir, save_interval, most_recent_k options - Wire up checkpoint arguments to recipe configuration Manually applied changes from commits: - 5e78174 (vp check fix) - 4c9881f (vp default value) - 853348f/b25d66d3 (checkpoint functionality)
- Add nsys_trace=["cuda"] for CUDA tracing - Add nsys_extra_args with detailed profiling options: - Force overwrite existing profiles - Capture range using CUDA profiler API - Enable CUDA graph tracing at node level - Disable CUDA event trace to reduce overhead - Include NCCL domain in NVTX traces Partially adapted from commit a514e44
- Profile all ranks by default when profiling_ranks is None - Maintains user override capability via --profiling_ranks argument - Matches behavior from original commit a514e44
- Use default directory /nemo_run/code/nemo_experiments/default/checkpoints - Only when save_interval is set but save_dir is not specified - Matches behavior from original checkpoint implementation
- Checkpoint save enabled by default to /nemo_run/code/nemo_experiments/default/checkpoints - Save interval defaults to train_iters (checkpoint at end of training) - Users can override via --save_dir and --save_interval arguments - Removes checkpoint.save = None from common perf overrides
- Remove conditional logic for profiling_ranks - Always profile all GPUs: profile_ranks=list(range(num_gpus)) - Simplifies configuration as all ranks should always be profiled
- When --save_interval is provided, enable checkpointing with default or specified save_dir - Default save_dir to /nemo_run/code/nemo_experiments/default/checkpoints when not specified - Default save_interval to train_iters when only save_dir is provided - Removed manual train_iters=1 setting for checkpoint loading (handled by --max_steps from bash script) - Aligns with ENABLE_CHECKPOINT behavior in bash wrapper script
| recipe.checkpoint.save = args.save_dir | ||
| logger.info(f"Checkpoint save directory set to: {args.save_dir}") | ||
| else: | ||
| recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| recipe.checkpoint.save = "/nemo_run/code/nemo_experiments/default/checkpoints" | |
| recipe.checkpoint.save = "/nemo_run/checkpoints" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malay-nagda We want the default location of the checkpoints to be under the code/nemo_experiments/default folder. This is to keep consistent with other workloads (ex: nemotron4, etc...).
- Change default from None to -1 in argument_parser.py - Change condition from hasattr() to != -1 in overrides.py - Allows users to explicitly set vp to None when needed - Distinguishes between 'not provided' and 'explicitly None'
- Created _set_checkpoint_overrides() to handle all checkpoint save/load logic - Simplified set_user_overrides() by calling the new method - Improves code organization and reusability - Consistent with other helper methods like _set_megatron_fsdp_overrides()
- Add --nsys_trace CLI arg to customize traced events (defaults to nemo_run defaults) - Add --nsys_extra_args CLI arg to provide additional nsys arguments - Fix profile_ranks to use --profiling_ranks from CLI instead of hardcoding to all ranks - Update NsysPlugin to combine default extra args with user-provided args - Preserve default behavior: profiles rank 0 only when --profiling_ranks not specified This fixes the issue where --profiling_ranks was being ignored and all ranks were always profiled, and allows users to customize nsys trace and extra args.
Manually applied changes from commits: