-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
WIP: Add EPLB support for Grok1 #21273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces Expert Parallelism Load Balancing (EPLB) support for the Grok-1 model. My review identified a few critical issues that could lead to runtime errors, particularly in distributed setups with pipeline parallelism. Addressing these will be crucial for the stability and correctness of the implementation.
super().__init__() | ||
self.hidden_size = hidden_size | ||
|
||
self.ep_group = get_ep_group().device_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -376,7 +410,9 @@ def load_weights(self, weights: Iterable[tuple[str, | |||
ckpt_gate_proj_name="linear", # Grok1 specific | |||
ckpt_down_proj_name="linear_1", # Grok1 specific | |||
ckpt_up_proj_name="linear_v", # Grok1 specific | |||
num_experts=num_experts) | |||
num_experts=num_experts, | |||
num_redundant_experts=self.num_redundant_experts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The attribute self.num_redundant_experts
is used here but it is not defined on the Grok1Model
class. It is defined on Grok1ForCausalLM
, but load_weights
is a method of Grok1Model
. This will result in an AttributeError
.
To fix this, you should initialize num_redundant_experts
in Grok1Model.__init__
and set it as an attribute there, similar to how enable_eplb
is handled.
# All layers in Grok1 have MoE blocks. | ||
self.expert_weights = [ | ||
layer.moe_block.get_expert_weights() for layer in self.model.layers | ||
] | ||
|
||
# Set MoE hyperparameters. | ||
self.moe_layers: list[FusedMoE] = [ | ||
layer.moe_block.experts for layer in self.model.layers | ||
] | ||
|
||
example_moe = typing.cast(Grok1MoE, self.model.layers[0].moe_block) | ||
self.num_logical_experts = example_moe.n_logical_experts | ||
self.num_physical_experts = example_moe.n_physical_experts | ||
self.num_local_physical_experts = example_moe.n_local_physical_experts | ||
self.num_routed_experts = example_moe.n_routed_experts | ||
self.num_redundant_experts = example_moe.n_redundant_experts | ||
self.num_shared_experts = example_moe.n_shared_experts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code will fail when pipeline parallelism is enabled. self.model.layers
can contain PPMissingLayer
placeholders, which do not have a moe_block
attribute. Accessing self.model.layers[0].moe_block
directly is not safe.
You should iterate through the layers to find the first valid one to use as an example, and filter out PPMissingLayer
when building expert_weights
and moe_layers
. A safer way is to check for the existence of moe_block
or filter for layers that are not PPMissingLayer
.
# All layers in Grok1 have MoE blocks.
# NOTE: This may be empty if no layers are on this rank.
local_layers = [
layer for layer in self.model.layers
if hasattr(layer, "moe_block")
]
self.expert_weights = [
layer.moe_block.get_expert_weights() for layer in local_layers
]
# Set MoE hyperparameters.
self.moe_layers: list[FusedMoE] = [
layer.moe_block.experts for layer in local_layers
]
if local_layers:
example_moe = typing.cast(Grok1MoE, local_layers[0].moe_block)
self.num_logical_experts = example_moe.n_logical_experts
self.num_physical_experts = example_moe.n_physical_experts
self.num_local_physical_experts = example_moe.n_local_physical_experts
self.num_routed_experts = example_moe.n_routed_experts
self.num_redundant_experts = example_moe.n_redundant_experts
self.num_shared_experts = example_moe.n_shared_experts
else:
# No layers on this rank, set defaults for the interface.
self.num_logical_experts = 0
self.num_physical_experts = 0
self.num_local_physical_experts = 0
self.num_routed_experts = 0
self.num_redundant_experts = 0
self.num_shared_experts = 0
self.n_logical_experts = self.n_routed_experts | ||
self.n_physical_experts = (self.n_logical_experts + | ||
self.n_redundant_experts) | ||
self.n_local_physical_experts = self.n_physical_experts // self.ep_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of physical experts (self.n_physical_experts
) should be divisible by the expert parallelism size (self.ep_size
). If it's not, this integer division will lead to an incorrect number of local experts on some ranks, which can cause issues later. It's best to add an assertion to ensure this condition is met.
self.n_local_physical_experts = self.n_physical_experts // self.ep_size | |
assert self.n_physical_experts % self.ep_size == 0, \ | |
"The number of physical experts must be divisible by the expert parallel size." | |
self.n_local_physical_experts = self.n_physical_experts // self.ep_size |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Purpose
FIX (partial) #18343
Test Plan
WIP: Need to set up a smaller version of Grok1 to test locally
Test Result
WIP