[Feature] Add UVM-based MoE expert offloading with all-GPU compute by lichang98 · Pull Request #20126 · sgl-project/sglang

lichang98 · 2026-03-08T11:46:36Z

Motivation

GLM-5-FP8 has 256 experts per MoE layer. Even with TP=8 on a single host of 8× H20 (96 GB each, 768 GB total VRAM), it cannot fit in GPU memory. Rather than requiring multi-host deployment, this PR offloads a subset of MoE expert weights to host memory using CUDA Unified Virtual Memory (UVM).

Unlike KTransformers, which moves computation to CPU for offloaded experts, this implementation keeps all computation on GPU — offloaded experts are accessed transparently via PCIe read-through. The existing prefill and decode paths remain compatible and decode can still benefit from CUDA graph replay.

Modifications

Commit 1: UVM-based expert weight offloading infrastructure

New module python/sglang/srt/layers/moe/expert_offload/ with:
- config.py — ExpertOffloadConfig dataclass and factory from ServerArgs
- manager.py — ExpertOffloadManager that allocates UVM tensors, applies cudaMemAdvise (PREFER_GPU for resident, PREFER_CPU + ACCESSED_BY_GPU for offloaded)
- uvm.py — Low-level UVM allocation/deallocation via ctypes bindings to CUDA runtime
- uvm_ops.cu — JIT-compiled CUDA kernels for UVM operations
- prefetch.py — Speculative and frequency-based prefetch strategies
- wrapper.py — ExpertOffloadWrapperMethod that integrates with FusedMoE layer
server_args.py — Four new CLI flags: --expert-offload-num-resident, --expert-offload-prefetch, --expert-offload-resident-selection, --expert-offload-resident-ids
cuda_graph_runner.py — Hook to ensure UVM compatibility with CUDA graph capture
utils/common.py — Shared helper utilities

Commit 2: Per-layer adaptive resident expert selection via frequency warmup

manager.py — Added record_expert_usage() to collect per-layer routing frequencies during early prefill passes; after accumulating enough tokens (default 4096), recompute optimal resident set per layer and
promote/demote experts via cudaMemAdvise
config.py — Added warmup_tokens config field
fused_moe_triton/layer.py — Wired frequency tracking into FusedMoE.forward_impl()

Accuracy Tests

Benchmarking and Profiling

Test with glm5-fp8 on H20 * 8:
execute it after install sglang

pip install --upgrade transformers

Test command:

python3 -m sglang.launch_server --model-path /models/GLM-5-FP8 \
  --host 0.0.0.0 --port 9090 \
  --page-size 256 \
  --mem-fraction-static 0.85 --tensor-parallel-size 8 \
  --hicache-io-backend direct \
  --hicache-mem-layout page_first \
  --chunked-prefill-size 16384 \
  --attention-backend flashinfer \
  --reasoning-parser glm45 \
  --decode-log-interval 2 \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-num-steps 3 \
  --tool-call-parser glm47 \
  --expert-offload-num-resident 200 \
  --expert-offload-prefetch speculative \
  --expert-offload-resident-selection frequency \
  --enable-flashinfer-allreduce-fusion

This is an early-stage implementation and there is plenty of room for improvement. I'd be happy if anyone is interested in joining together to make it better!

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Implement CUDA Unified Memory (cudaMallocManaged) based expert offloading as an alternative to KTransformers. Resident experts use PREFER_GPU advice (VRAM bandwidth), offloaded experts use PREFER_CPU + ACCESSED_BY_GPU (PCIe read-through without page faults). No ID remapping, no LRU cache, no assembly buffer — UVM is fully transparent to CUDA graphs. Profile analysis on GLM-5-FP8 (TP=8, 200/256 resident, EAGLE spec decode) shows ~8.7% decode overhead from UVM PCIe reads, zero CPU hot-path cost during CUDA graph replay. Speculative prefetch is not yet active during decode and needs follow-up work. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… warmup During early eager forward passes (prefill), collect per-layer expert routing frequencies. After accumulating enough tokens, recompute the optimal resident set per layer and call cudaMemAdvise to promote/demote experts. This replaces the static first_n selection when --expert-offload-resident-selection frequency is used. - Add warmup_tokens config field (default 4096) - Add record_expert_usage() with CUDA graph capture guard - Filter out small batches (< 64 tokens) to ignore dummy warmup data - Wire frequency tracking into FusedMoE.forward_impl() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-08T11:46:57Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and efficient UVM-based expert offloading solution for Mixture-of-Experts (MoE) layers. The primary goal is to enable the execution of very large MoE models, such as GLM-5-FP8, on single-host GPU setups by intelligently managing expert weights across GPU and host memory. By keeping all computation on the GPU and ensuring full compatibility with CUDA graphs, this approach offers a seamless and high-performance experience, addressing memory constraints without compromising model quality or inference speed during decode.

Highlights

UVM-based Expert Offloading: Introduced a new mechanism leveraging CUDA Unified Memory (UVM) to offload MoE expert weights to host memory, enabling larger models to fit into GPU memory without multi-host deployment.
All-GPU Compute: Ensured that all computation remains on the GPU, with offloaded experts accessed transparently via PCIe read-through, unlike previous methods that moved computation to the CPU.
CUDA Graph Compatibility: Designed the UVM offloading to be fully compatible with CUDA graph capture and replay, preserving full decode quality and avoiding degraded performance or masked experts.
Adaptive Resident Expert Selection: Implemented a per-layer adaptive strategy to select resident experts based on routing frequencies collected during prefill passes, allowing dynamic optimization of which experts reside on the GPU.
Simplified Architecture: Eliminated significant complexity from prior offloading designs, such as pinned-memory, assembly buffers, ID remapping, and dual static/dynamic execution paths, by utilizing UVM's inherent transparency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

expert_offload_plan.md
- Added a comprehensive design document detailing the UVM-based expert offloading mechanism, its rationale, implementation, and performance analysis.
python/sglang/srt/layers/moe/expert_offload/init.py
- Added a new module initialization file to export core components of the UVM expert offloading system.
python/sglang/srt/layers/moe/expert_offload/config.py
- Added a new configuration file defining the ExpertOffloadConfig dataclass and a factory function to create it from server arguments, including new fields for warmup tokens and removing obsolete cache-related fields.
python/sglang/srt/layers/moe/expert_offload/manager.py
- Added a new manager file implementing ExpertOffloadManager, responsible for UVM allocation, memory advising, prefetching, and the new adaptive resident expert selection logic based on routing frequencies.
python/sglang/srt/layers/moe/expert_offload/prefetch.py
- Added a new prefetch file defining an abstract ExpertPrefetchStrategy and concrete implementations for 'none', 'speculative', and 'frequency' based prefetching.
python/sglang/srt/layers/moe/expert_offload/uvm.py
- Added a new UVM wrapper file providing Python bindings for low-level CUDA UVM operations, loading a JIT-compiled CUDA extension.
python/sglang/srt/layers/moe/expert_offload/uvm_ops.cu
- Added a new CUDA C++ extension file that exposes functions for UVM memory allocation, advising, and asynchronous prefetching to Python.
python/sglang/srt/layers/moe/expert_offload/wrapper.py
- Added a new wrapper file defining ExpertOffloadWrapperMethod, which integrates the UVM offloading logic with the FusedMoEMethodBase, simplifying the overall offloading process.
python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Imported new expert offloading components.
- Wrapped the quantization method with ExpertOffloadWrapperMethod if UVM offloading is enabled.
- Added a call to record expert routing frequencies in the forward pass for adaptive resident expert selection.
python/sglang/srt/model_executor/cuda_graph_runner.py
- Added a log message to inform users about the behavior of expert offloading when CUDA graph mode is active, clarifying that decode uses a static path for resident experts.
python/sglang/srt/server_args.py
- Added new CLI arguments for configuring expert offloading, including number of resident experts, prefetch strategy, resident selection method, and manual resident IDs.
- Implemented validation logic to ensure expert offloading is mutually exclusive with KTransformers offloading.
python/sglang/srt/utils/common.py
- Added a mechanism to register UVM-evictable memory, which helps in accurately reporting available GPU memory and preventing false out-of-memory errors during initialization.

Activity

The initial commit focused on establishing the core UVM-based expert weight offloading infrastructure, including new modules for configuration, manager, UVM operations, prefetch strategies, and a wrapper for integration.
A subsequent commit introduced per-layer adaptive resident expert selection via frequency warmup, adding logic to manager.py and config.py to track expert usage and re-advise memory based on observed frequencies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed feature for UVM-based MoE expert offloading. The use of CUDA Unified Memory to simplify the architecture while maintaining full quality during decode is a great approach. The code is well-structured, and the inclusion of a detailed design document is highly appreciated. My review identifies a few areas for improvement, including a potential bug in argument parsing, some leftover dead code, and a misleading log message that should be corrected to accurately reflect the feature's capabilities. Overall, this is a solid contribution that will be even better with these minor adjustments.

gemini-code-assist · 2026-03-08T11:49:57Z

python/sglang/srt/layers/moe/expert_offload/config.py

+        resident_ids = [
+            int(x.strip())
+            for x in server_args.expert_offload_resident_ids.split(",")
+        ]


The current implementation for parsing expert_offload_resident_ids is susceptible to a ValueError if the string contains consecutive commas (e.g., '1,2,,3') or a trailing comma, as x.strip() would be an empty string, and int('') is invalid. It's safer to filter out empty strings before converting to integers.

Suggested change

resident_ids = [

int(x.strip())

for x in server_args.expert_offload_resident_ids.split(",")

]

resident_ids = [

int(x.strip())

for x in server_args.expert_offload_resident_ids.split(",")

if x.strip()

]

gemini-code-assist · 2026-03-08T11:49:57Z

python/sglang/srt/model_executor/cuda_graph_runner.py

+        if model_runner.server_args.expert_offload_num_resident >= 0:
+            log_info_on_rank0(
+                logger,
+                "[ExpertOffload] CUDA graph mode: decode uses static path (resident experts only). "
+                "Prefill always uses dynamic path (full quality). "
+                "Use --disable-cuda-graph for full decode quality.",
+            )


This log message is misleading. It states that with CUDA graph mode, decode uses a static path with 'resident experts only', implying a loss of quality. However, the design document (expert_offload_plan.md) and the implementation show that UVM allows transparent access to offloaded experts via PCIe read-through, ensuring full quality even during CUDA graph replay. The log message should be corrected to reflect that there is no quality degradation.

if model_runner.server_args.expert_offload_num_resident >= 0: log_info_on_rank0( logger, "[ExpertOffload] CUDA graph mode enabled. Offloaded experts will be accessed " "transparently via PCIe, ensuring full decode quality.", )

gemini-code-assist · 2026-03-08T11:49:57Z

expert_offload_plan.md

+  --host 0.0.0.0 --port 9090 \
+  --page-size 256 \
+  --mem-fraction-static 0.85 --tensor-parallel-size 8 \
+  --hicache-io-backend kernel \


There's a minor inconsistency in the test command. Here, --hicache-io-backend is set to kernel, but in the pull request description, it's set to direct. For clarity and reproducibility, it would be best to ensure these commands are consistent.

gemini-code-assist · 2026-03-08T11:49:57Z

python/sglang/srt/layers/moe/expert_offload/manager.py

+    def prefetch_experts(
+        self,
+        expert_ids: List[int],
+        stream: Optional[torch.cuda.Stream] = None,
+    ) -> None:
+        """Speculatively prefetch expert pages to GPU VRAM.
+
+        Call this with predicted expert IDs *before* a CUDA graph decode step
+        to upgrade PCIe-speed accesses to VRAM-speed accesses for those experts.
+        Correctness does not depend on this call — it is a pure optimisation.
+
+        ``expert_ids`` should only contain offloaded (non-resident) expert IDs.
+        """
+        if stream is None:
+            stream = self.prefetch_stream
+
+        # We need managed tensor references to issue prefetch on.
+        # Iterate over the first param to get the managed tensor.
+        if not self._expert_param_names:
+            return
+
+        # NOTE: We assume all params have the same expert ordering.
+        # The caller is responsible for providing valid offloaded expert IDs.
+        for expert_id in expert_ids:
+            if expert_id < 0 or expert_id >= self.config.num_local_experts:
+                continue
+            # Each managed tensor holds expert weights; prefetch per-param.
+            # Access managed tensors stored on the layer via the stored names.
+            # (The actual managed tensors are on the layer object itself.)
+            # This method is called by the wrapper which has access to layer.
+            pass  # See ExpertOffloadWrapperMethod.prefetch_experts_on_layer


The prefetch_experts method appears to be unused. The prefetching logic is implemented in ExpertOffloadWrapperMethod.prefetch_experts_on_layer, which does not call this method. To improve clarity and remove dead code, consider removing this method from the ExpertOffloadManager.

gemini-code-assist · 2026-03-08T11:49:58Z

python/sglang/srt/layers/moe/expert_offload/prefetch.py

+# Copyright 2024 SGLang Team
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Prefetch strategies for expert weight offloading."""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from typing import List, Optional
+
+import torch
+
+
+class ExpertPrefetchStrategy(ABC):
+    """Abstract base class for expert prefetch strategies."""
+
+    @abstractmethod
+    def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
+        """Predict which expert IDs to prefetch for the next layer.
+
+        Args:
+            topk_ids: Current layer's top-k expert IDs [batch, top_k].
+            layer_idx: Current layer index.
+
+        Returns:
+            List of expert IDs to prefetch (may be empty).
+        """
+        ...
+
+    def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
+        """Update internal statistics after seeing actual expert usage."""
+        pass
+
+
+class NoPrefetch(ExpertPrefetchStrategy):
+    """No-op prefetch strategy."""
+
+    def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
+        return []
+
+
+class SpeculativePrefetch(ExpertPrefetchStrategy):
+    """Cross-layer correlation-based speculative prefetch.
+
+    Maintains a co-occurrence matrix: ``corr[i, j]`` counts how often expert
+    ``i`` in layer L predicts expert ``j`` in layer L+1.  After a warmup
+    period, for each expert selected in the current layer we look up its
+    top-correlated next-layer experts and return those as prefetch candidates.
+    """
+
+    WARMUP_STEPS = 32
+
+    def __init__(self, num_experts: int, num_layers: int, num_predictions: int):
+        self.num_experts = num_experts
+        self.num_layers = num_layers
+        self.num_predictions = num_predictions
+        self._step = 0
+
+        # corr[layer, expert_prev, expert_next] — updated online.
+        self.corr = torch.zeros(num_layers, num_experts, num_experts, dtype=torch.float32)
+        # Expert selections from the previous step, per layer.
+        self._prev_ids: Optional[torch.Tensor] = None
+
+    def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
+        if self._step < self.WARMUP_STEPS:
+            return []
+        next_layer = layer_idx + 1
+        if next_layer >= self.num_layers:
+            return []
+        # Accumulate correlation scores for the next layer.
+        scores = torch.zeros(self.num_experts, dtype=torch.float32)
+        for eid in topk_ids.unique().tolist():
+            eid = int(eid)
+            if 0 <= eid < self.num_experts:
+                scores += self.corr[next_layer, eid]
+        _, top_ids = scores.topk(min(self.num_predictions, self.num_experts))
+        return top_ids.tolist()
+
+    def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
+        self._step += 1
+        unique_ids = [int(x) for x in topk_ids.unique().tolist() if x >= 0]
+        if self._prev_ids is not None and layer_idx > 0:
+            prev_unique = [int(x) for x in self._prev_ids.unique().tolist() if x >= 0]
+            for prev in prev_unique:
+                for curr in unique_ids:
+                    if 0 <= prev < self.num_experts and 0 <= curr < self.num_experts:
+                        self.corr[layer_idx, prev, curr] += 1.0
+        # Save current ids as "previous" for the next layer's update call.
+        self._prev_ids = topk_ids.detach().cpu()
+
+
+class FrequencyPrefetch(ExpertPrefetchStrategy):
+    """Always prefetch the globally most-frequent experts."""
+
+    def __init__(self, num_experts: int, num_predictions: int):
+        self.num_experts = num_experts
+        self.num_predictions = num_predictions
+        self.freq = torch.zeros(num_experts, dtype=torch.long)
+        self._top_ids: List[int] = []
+
+    def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
+        return self._top_ids
+
+    def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
+        for eid in topk_ids.unique().tolist():
+            eid = int(eid)
+            if 0 <= eid < self.num_experts:
+                self.freq[eid] += 1
+        _, top_ids = self.freq.topk(min(self.num_predictions, self.num_experts))
+        self._top_ids = top_ids.tolist()
+
+
+def create_prefetch_strategy(
+    strategy_name: str,
+    num_experts: int,
+    num_layers: int,
+    num_cache_slots: int,
+) -> ExpertPrefetchStrategy:
+    """Factory function for prefetch strategies."""
+    if strategy_name == "none":
+        return NoPrefetch()
+    elif strategy_name == "speculative":
+        return SpeculativePrefetch(num_experts, num_layers, num_predictions=num_cache_slots)
+    elif strategy_name == "frequency":
+        return FrequencyPrefetch(num_experts, num_predictions=num_cache_slots)
+    else:
+        raise ValueError(f"Unknown prefetch strategy: {strategy_name!r}")


This file seems to contain code from an earlier design and is not imported or used anywhere in the current implementation. It should be removed to avoid confusion and keep the codebase clean.

lichang98 and others added 2 commits March 8, 2026 19:26

lichang98 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock and merrymercy as code owners March 8, 2026 11:46

github-actions bot added the documentation Improvements or additions to documentation label Mar 8, 2026

gemini-code-assist bot reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add UVM-based MoE expert offloading with all-GPU compute#20126

[Feature] Add UVM-based MoE expert offloading with all-GPU compute#20126
lichang98 wants to merge 2 commits intosgl-project:mainfrom
lichang98:feat/expert_offload

lichang98 commented Mar 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Uh oh!

gemini-code-assist bot Mar 8, 2026

Uh oh!

gemini-code-assist bot Mar 8, 2026

Uh oh!

gemini-code-assist bot Mar 8, 2026

Uh oh!

gemini-code-assist bot Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lichang98 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lichang98 commented Mar 8, 2026 •

edited

Loading