Skip to content

Conversation

jennifurhe
Copy link
Contributor

@jennifurhe jennifurhe commented Jul 21, 2025

Purpose

FIX (partial) #18343

Test Plan

WIP: Need to set up a smaller version of Grok1 to test locally

Test Result

WIP

Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jennifurhe.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Expert Parallelism Load Balancing (EPLB) support for the Grok-1 model. My review identified a few critical issues that could lead to runtime errors, particularly in distributed setups with pipeline parallelism. Addressing these will be crucial for the stability and correctness of the implementation.

super().__init__()
self.hidden_size = hidden_size

self.ep_group = get_ep_group().device_group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The function get_ep_group() returns a torch.distributed.ProcessGroup, which does not have a device_group attribute. This will cause an AttributeError at runtime. You should use get_ep_group() directly.

Suggested change
self.ep_group = get_ep_group().device_group
self.ep_group = get_ep_group()

@@ -376,7 +410,9 @@ def load_weights(self, weights: Iterable[tuple[str,
ckpt_gate_proj_name="linear", # Grok1 specific
ckpt_down_proj_name="linear_1", # Grok1 specific
ckpt_up_proj_name="linear_v", # Grok1 specific
num_experts=num_experts)
num_experts=num_experts,
num_redundant_experts=self.num_redundant_experts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attribute self.num_redundant_experts is used here but it is not defined on the Grok1Model class. It is defined on Grok1ForCausalLM, but load_weights is a method of Grok1Model. This will result in an AttributeError.

To fix this, you should initialize num_redundant_experts in Grok1Model.__init__ and set it as an attribute there, similar to how enable_eplb is handled.

Comment on lines +570 to +586
# All layers in Grok1 have MoE blocks.
self.expert_weights = [
layer.moe_block.get_expert_weights() for layer in self.model.layers
]

# Set MoE hyperparameters.
self.moe_layers: list[FusedMoE] = [
layer.moe_block.experts for layer in self.model.layers
]

example_moe = typing.cast(Grok1MoE, self.model.layers[0].moe_block)
self.num_logical_experts = example_moe.n_logical_experts
self.num_physical_experts = example_moe.n_physical_experts
self.num_local_physical_experts = example_moe.n_local_physical_experts
self.num_routed_experts = example_moe.n_routed_experts
self.num_redundant_experts = example_moe.n_redundant_experts
self.num_shared_experts = example_moe.n_shared_experts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block of code will fail when pipeline parallelism is enabled. self.model.layers can contain PPMissingLayer placeholders, which do not have a moe_block attribute. Accessing self.model.layers[0].moe_block directly is not safe.

You should iterate through the layers to find the first valid one to use as an example, and filter out PPMissingLayer when building expert_weights and moe_layers. A safer way is to check for the existence of moe_block or filter for layers that are not PPMissingLayer.

# All layers in Grok1 have MoE blocks.
        # NOTE: This may be empty if no layers are on this rank.
        local_layers = [
            layer for layer in self.model.layers
            if hasattr(layer, "moe_block")
        ]
        self.expert_weights = [
            layer.moe_block.get_expert_weights() for layer in local_layers
        ]

        # Set MoE hyperparameters.
        self.moe_layers: list[FusedMoE] = [
            layer.moe_block.experts for layer in local_layers
        ]

        if local_layers:
            example_moe = typing.cast(Grok1MoE, local_layers[0].moe_block)
            self.num_logical_experts = example_moe.n_logical_experts
            self.num_physical_experts = example_moe.n_physical_experts
            self.num_local_physical_experts = example_moe.n_local_physical_experts
            self.num_routed_experts = example_moe.n_routed_experts
            self.num_redundant_experts = example_moe.n_redundant_experts
            self.num_shared_experts = example_moe.n_shared_experts
        else:
            # No layers on this rank, set defaults for the interface.
            self.num_logical_experts = 0
            self.num_physical_experts = 0
            self.num_local_physical_experts = 0
            self.num_routed_experts = 0
            self.num_redundant_experts = 0
            self.num_shared_experts = 0

self.n_logical_experts = self.n_routed_experts
self.n_physical_experts = (self.n_logical_experts +
self.n_redundant_experts)
self.n_local_physical_experts = self.n_physical_experts // self.ep_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The number of physical experts (self.n_physical_experts) should be divisible by the expert parallelism size (self.ep_size). If it's not, this integer division will lead to an incorrect number of local experts on some ranks, which can cause issues later. It's best to add an assertion to ensure this condition is met.

Suggested change
self.n_local_physical_experts = self.n_physical_experts // self.ep_size
assert self.n_physical_experts % self.ep_size == 0, \
"The number of physical experts must be divisible by the expert parallel size."
self.n_local_physical_experts = self.n_physical_experts // self.ep_size

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant