Skip to content

Conversation

@aroshanghias-nvd
Copy link

@aroshanghias-nvd aroshanghias-nvd commented Jan 20, 2026

What does this PR do?

Implements Phase 2 of MIMO upstreaming: adds DDP wrapping utilities, embedding group support for PP > 1, and improved validation for heterogeneous deployment.

⚠️ Dependency: This PR depends on #2040 (Phase 1: unified MimoModelProvider with ModuleSpec-based API). Please merge #2040 first.

Changelog

  • Add wrap_mimo_model_distributed() for rank-aware DDP wrapping of MIMO submodules
  • Add embedding group helpers to mimo_builder.py:
    • populate_embedding_and_position_groups() for PP > 1 support
    • is_pp_first_stage() / is_pp_last_stage() helpers
    • is_current_rank_in_grid() for rank participation checks
  • Improve gap detection in MimoParallelismConfig._validate_heterogeneous():
    • Error for gaps between modules (likely misconfiguration)
    • Warning for leading unused ranks (could be intentional)
  • Extend _get_pg_collections_from_grids() to populate pos_embd and embd process groups
  • Add comprehensive unit tests for all new functionality

Files Changed

File Change
models/mimo/mimo_builder.py Added embedding group and rank participation helpers
models/mimo/mimo_provider.py Imports helpers from mimo_builder, populates embedding groups in pg_collections
training/mimo_config.py Improved gap detection (error for middle gaps, warning for leading)
training/mimo_ddp.py New file: DDP wrapping utilities for MIMO models
tests/ Added tests for embedding groups, gap detection, and DDP wrapping

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • N/A - no new optional dependencies

Additional Information

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Ali Roshan Ghias added 2 commits January 22, 2026 15:33
Remove EncoderProvider abstraction in favor of passing ModuleSpecs directly
to MimoModelProvider. Rename MIMOConfig to MimoParallelismConfig for clarity.

Key changes:
- Add MimoModelProvider that accepts language_model_spec and modality_submodules_spec
- Remove EncoderProvider, EncoderTransformerConfig, GenericVisionEncoderProvider
- Rename MIMOConfig to MimoParallelismConfig (cfg.mimo field name unchanged)
- Update build_hypercomm_grids and build_colocated_comm_config signatures
- Fix ColocatedCommConfig logic to apply to both "colocated" and "homogeneous" modes
- Fix import: ProcessGroupCollection is in megatron.core.process_groups_config
- Add LlavaMimoProvider convenience subclass
- Remove obsolete encoder_providers validation from ConfigContainer
- Add comprehensive unit tests for MimoModelProvider
- Update and fix test_mimo_config.py (remove encoder_provider tests)

All MIMO tests passing (5/5 config tests, provider tests).
Move MIMO grid/topology helpers into the mimo package so model construction stays self-contained and
remove unused MIMO convenience accessors while updating the config union.
from megatron.bridge.training.mimo_config import MimoParallelismConfig


def wrap_mimo_model_distributed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be move to mimo folder?

Ali Roshan Ghias added 2 commits January 26, 2026 10:30
- Remove unused get_module_data_parallel_size from ConfigContainer
- Move mimo_config.py from training/ to models/mimo/ for better cohesion
- Export MimoParallelismConfig and ModuleParallelismConfig from mimo __init__
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants