WIP: Add EPLB support for Grok1 #21273

jennifurhe · 2025-07-21T00:32:27Z

Purpose

FIX (partial) #18343

Test Plan

WIP: Need to set up a smaller version of Grok1 to test locally

Test Result

WIP

Signed-off-by: <[email protected]>

mergify · 2025-07-21T00:33:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jennifurhe.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces Expert Parallelism Load Balancing (EPLB) support for the Grok-1 model. My review identified a few critical issues that could lead to runtime errors, particularly in distributed setups with pipeline parallelism. Addressing these will be crucial for the stability and correctness of the implementation.

gemini-code-assist · 2025-07-21T00:34:23Z

vllm/model_executor/models/grok1.py

        super().__init__()
        self.hidden_size = hidden_size

+        self.ep_group = get_ep_group().device_group


The function get_ep_group() returns a torch.distributed.ProcessGroup, which does not have a device_group attribute. This will cause an AttributeError at runtime. You should use get_ep_group() directly.

Suggested change

self.ep_group = get_ep_group().device_group

self.ep_group = get_ep_group()

gemini-code-assist · 2025-07-21T00:34:23Z

vllm/model_executor/models/grok1.py

@@ -376,7 +410,9 @@ def load_weights(self, weights: Iterable[tuple[str,
            ckpt_gate_proj_name="linear",  # Grok1 specific
            ckpt_down_proj_name="linear_1",  # Grok1 specific
            ckpt_up_proj_name="linear_v",  # Grok1 specific
-            num_experts=num_experts)
+            num_experts=num_experts,
+            num_redundant_experts=self.num_redundant_experts,


The attribute self.num_redundant_experts is used here but it is not defined on the Grok1Model class. It is defined on Grok1ForCausalLM, but load_weights is a method of Grok1Model. This will result in an AttributeError.

To fix this, you should initialize num_redundant_experts in Grok1Model.__init__ and set it as an attribute there, similar to how enable_eplb is handled.

gemini-code-assist · 2025-07-21T00:34:23Z

vllm/model_executor/models/grok1.py

+        # All layers in Grok1 have MoE blocks.
+        self.expert_weights = [
+            layer.moe_block.get_expert_weights() for layer in self.model.layers
+        ]
+
+        # Set MoE hyperparameters.
+        self.moe_layers: list[FusedMoE] = [
+            layer.moe_block.experts for layer in self.model.layers
+        ]
+
+        example_moe = typing.cast(Grok1MoE, self.model.layers[0].moe_block)
+        self.num_logical_experts = example_moe.n_logical_experts
+        self.num_physical_experts = example_moe.n_physical_experts
+        self.num_local_physical_experts = example_moe.n_local_physical_experts
+        self.num_routed_experts = example_moe.n_routed_experts
+        self.num_redundant_experts = example_moe.n_redundant_experts
+        self.num_shared_experts = example_moe.n_shared_experts


This block of code will fail when pipeline parallelism is enabled. self.model.layers can contain PPMissingLayer placeholders, which do not have a moe_block attribute. Accessing self.model.layers[0].moe_block directly is not safe.

You should iterate through the layers to find the first valid one to use as an example, and filter out PPMissingLayer when building expert_weights and moe_layers. A safer way is to check for the existence of moe_block or filter for layers that are not PPMissingLayer.

# All layers in Grok1 have MoE blocks. # NOTE: This may be empty if no layers are on this rank. local_layers = [ layer for layer in self.model.layers if hasattr(layer, "moe_block") ] self.expert_weights = [ layer.moe_block.get_expert_weights() for layer in local_layers ] # Set MoE hyperparameters. self.moe_layers: list[FusedMoE] = [ layer.moe_block.experts for layer in local_layers ] if local_layers: example_moe = typing.cast(Grok1MoE, local_layers[0].moe_block) self.num_logical_experts = example_moe.n_logical_experts self.num_physical_experts = example_moe.n_physical_experts self.num_local_physical_experts = example_moe.n_local_physical_experts self.num_routed_experts = example_moe.n_routed_experts self.num_redundant_experts = example_moe.n_redundant_experts self.num_shared_experts = example_moe.n_shared_experts else: # No layers on this rank, set defaults for the interface. self.num_logical_experts = 0 self.num_physical_experts = 0 self.num_local_physical_experts = 0 self.num_routed_experts = 0 self.num_redundant_experts = 0 self.num_shared_experts = 0

gemini-code-assist · 2025-07-21T00:34:23Z

vllm/model_executor/models/grok1.py

+        self.n_logical_experts = self.n_routed_experts
+        self.n_physical_experts = (self.n_logical_experts +
+                                   self.n_redundant_experts)
+        self.n_local_physical_experts = self.n_physical_experts // self.ep_size


The number of physical experts (self.n_physical_experts) should be divisible by the expert parallelism size (self.ep_size). If it's not, this integer division will lead to an incorrect number of local experts on some ranks, which can cause issues later. It's best to add an assertion to ensure this condition is met.

Suggested change

self.n_local_physical_experts = self.n_physical_experts // self.ep_size

assert self.n_physical_experts % self.ep_size == 0, \

"The number of physical experts must be divisible by the expert parallel size."

self.n_local_physical_experts = self.n_physical_experts // self.ep_size

github-actions · 2025-07-21T00:39:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WIP: Add EPLB support for Grok1

64ca563

Signed-off-by: <[email protected]>

mergify bot added the needs-rebase label Jul 21, 2025

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP: Add EPLB support for Grok1 #21273

WIP: Add EPLB support for Grok1 #21273

Uh oh!

jennifurhe commented Jul 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Uh oh!

gemini-code-assist bot Jul 21, 2025

Uh oh!

gemini-code-assist bot Jul 21, 2025

Uh oh!

gemini-code-assist bot Jul 21, 2025

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

Uh oh!

	self.ep_group = get_ep_group().device_group
	self.ep_group = get_ep_group()

Uh oh!

WIP: Add EPLB support for Grok1 #21273

Are you sure you want to change the base?

WIP: Add EPLB support for Grok1 #21273

Uh oh!

Conversation

jennifurhe commented Jul 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 21, 2025

Uh oh!

Uh oh!

jennifurhe commented Jul 21, 2025 •

edited by github-actions bot

Loading