Skip to content

[Feature] Add UVM-based MoE expert offloading with all-GPU compute#20126

Open
lichang98 wants to merge 2 commits intosgl-project:mainfrom
lichang98:feat/expert_offload
Open

[Feature] Add UVM-based MoE expert offloading with all-GPU compute#20126
lichang98 wants to merge 2 commits intosgl-project:mainfrom
lichang98:feat/expert_offload

Conversation

@lichang98
Copy link

@lichang98 lichang98 commented Mar 8, 2026

Motivation

GLM-5-FP8 has 256 experts per MoE layer. Even with TP=8 on a single host of 8× H20 (96 GB each, 768 GB total VRAM), it cannot fit in GPU memory. Rather than requiring multi-host deployment, this PR offloads a subset of MoE expert weights to host memory using CUDA Unified Virtual Memory (UVM).

Unlike KTransformers, which moves computation to CPU for offloaded experts, this implementation keeps all computation on GPU — offloaded experts are accessed transparently via PCIe read-through. The existing prefill and decode paths remain compatible and decode can still benefit from CUDA graph replay.

Modifications

Commit 1: UVM-based expert weight offloading infrastructure

  • New module python/sglang/srt/layers/moe/expert_offload/ with:
    • config.py — ExpertOffloadConfig dataclass and factory from ServerArgs
    • manager.py — ExpertOffloadManager that allocates UVM tensors, applies cudaMemAdvise (PREFER_GPU for resident, PREFER_CPU + ACCESSED_BY_GPU for offloaded)
    • uvm.py — Low-level UVM allocation/deallocation via ctypes bindings to CUDA runtime
    • uvm_ops.cu — JIT-compiled CUDA kernels for UVM operations
    • prefetch.py — Speculative and frequency-based prefetch strategies
    • wrapper.py — ExpertOffloadWrapperMethod that integrates with FusedMoE layer
  • server_args.py — Four new CLI flags: --expert-offload-num-resident, --expert-offload-prefetch, --expert-offload-resident-selection, --expert-offload-resident-ids
  • cuda_graph_runner.py — Hook to ensure UVM compatibility with CUDA graph capture
  • utils/common.py — Shared helper utilities

Commit 2: Per-layer adaptive resident expert selection via frequency warmup

  • manager.py — Added record_expert_usage() to collect per-layer routing frequencies during early prefill passes; after accumulating enough tokens (default 4096), recompute optimal resident set per layer and
    promote/demote experts via cudaMemAdvise
  • config.py — Added warmup_tokens config field
  • fused_moe_triton/layer.py — Wired frequency tracking into FusedMoE.forward_impl()

Accuracy Tests

Benchmarking and Profiling

Test with glm5-fp8 on H20 * 8:
execute it after install sglang

pip install --upgrade transformers

Test command:

python3 -m sglang.launch_server --model-path /models/GLM-5-FP8 \
  --host 0.0.0.0 --port 9090 \
  --page-size 256 \
  --mem-fraction-static 0.85 --tensor-parallel-size 8 \
  --hicache-io-backend direct \
  --hicache-mem-layout page_first \
  --chunked-prefill-size 16384 \
  --attention-backend flashinfer \
  --reasoning-parser glm45 \
  --decode-log-interval 2 \
  --speculative-algorithm EAGLE \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-num-steps 3 \
  --tool-call-parser glm47 \
  --expert-offload-num-resident 200 \
  --expert-offload-prefetch speculative \
  --expert-offload-resident-selection frequency \
  --enable-flashinfer-allreduce-fusion
image

This is an early-stage implementation and there is plenty of room for improvement. I'd be happy if anyone is interested in joining together to make it better!

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

lichang98 and others added 2 commits March 8, 2026 19:26
Implement CUDA Unified Memory (cudaMallocManaged) based expert offloading
as an alternative to KTransformers. Resident experts use PREFER_GPU advice
(VRAM bandwidth), offloaded experts use PREFER_CPU + ACCESSED_BY_GPU
(PCIe read-through without page faults). No ID remapping, no LRU cache,
no assembly buffer — UVM is fully transparent to CUDA graphs.

Profile analysis on GLM-5-FP8 (TP=8, 200/256 resident, EAGLE spec decode)
shows ~8.7% decode overhead from UVM PCIe reads, zero CPU hot-path cost
during CUDA graph replay. Speculative prefetch is not yet active during
decode and needs follow-up work.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… warmup

During early eager forward passes (prefill), collect per-layer expert
routing frequencies. After accumulating enough tokens, recompute the
optimal resident set per layer and call cudaMemAdvise to promote/demote
experts. This replaces the static first_n selection when
--expert-offload-resident-selection frequency is used.

- Add warmup_tokens config field (default 4096)
- Add record_expert_usage() with CUDA graph capture guard
- Filter out small batches (< 64 tokens) to ignore dummy warmup data
- Wire frequency tracking into FusedMoE.forward_impl()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 8, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and efficient UVM-based expert offloading solution for Mixture-of-Experts (MoE) layers. The primary goal is to enable the execution of very large MoE models, such as GLM-5-FP8, on single-host GPU setups by intelligently managing expert weights across GPU and host memory. By keeping all computation on the GPU and ensuring full compatibility with CUDA graphs, this approach offers a seamless and high-performance experience, addressing memory constraints without compromising model quality or inference speed during decode.

Highlights

  • UVM-based Expert Offloading: Introduced a new mechanism leveraging CUDA Unified Memory (UVM) to offload MoE expert weights to host memory, enabling larger models to fit into GPU memory without multi-host deployment.
  • All-GPU Compute: Ensured that all computation remains on the GPU, with offloaded experts accessed transparently via PCIe read-through, unlike previous methods that moved computation to the CPU.
  • CUDA Graph Compatibility: Designed the UVM offloading to be fully compatible with CUDA graph capture and replay, preserving full decode quality and avoiding degraded performance or masked experts.
  • Adaptive Resident Expert Selection: Implemented a per-layer adaptive strategy to select resident experts based on routing frequencies collected during prefill passes, allowing dynamic optimization of which experts reside on the GPU.
  • Simplified Architecture: Eliminated significant complexity from prior offloading designs, such as pinned-memory, assembly buffers, ID remapping, and dual static/dynamic execution paths, by utilizing UVM's inherent transparency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • expert_offload_plan.md
    • Added a comprehensive design document detailing the UVM-based expert offloading mechanism, its rationale, implementation, and performance analysis.
  • python/sglang/srt/layers/moe/expert_offload/init.py
    • Added a new module initialization file to export core components of the UVM expert offloading system.
  • python/sglang/srt/layers/moe/expert_offload/config.py
    • Added a new configuration file defining the ExpertOffloadConfig dataclass and a factory function to create it from server arguments, including new fields for warmup tokens and removing obsolete cache-related fields.
  • python/sglang/srt/layers/moe/expert_offload/manager.py
    • Added a new manager file implementing ExpertOffloadManager, responsible for UVM allocation, memory advising, prefetching, and the new adaptive resident expert selection logic based on routing frequencies.
  • python/sglang/srt/layers/moe/expert_offload/prefetch.py
    • Added a new prefetch file defining an abstract ExpertPrefetchStrategy and concrete implementations for 'none', 'speculative', and 'frequency' based prefetching.
  • python/sglang/srt/layers/moe/expert_offload/uvm.py
    • Added a new UVM wrapper file providing Python bindings for low-level CUDA UVM operations, loading a JIT-compiled CUDA extension.
  • python/sglang/srt/layers/moe/expert_offload/uvm_ops.cu
    • Added a new CUDA C++ extension file that exposes functions for UVM memory allocation, advising, and asynchronous prefetching to Python.
  • python/sglang/srt/layers/moe/expert_offload/wrapper.py
    • Added a new wrapper file defining ExpertOffloadWrapperMethod, which integrates the UVM offloading logic with the FusedMoEMethodBase, simplifying the overall offloading process.
  • python/sglang/srt/layers/moe/fused_moe_triton/layer.py
    • Imported new expert offloading components.
    • Wrapped the quantization method with ExpertOffloadWrapperMethod if UVM offloading is enabled.
    • Added a call to record expert routing frequencies in the forward pass for adaptive resident expert selection.
  • python/sglang/srt/model_executor/cuda_graph_runner.py
    • Added a log message to inform users about the behavior of expert offloading when CUDA graph mode is active, clarifying that decode uses a static path for resident experts.
  • python/sglang/srt/server_args.py
    • Added new CLI arguments for configuring expert offloading, including number of resident experts, prefetch strategy, resident selection method, and manual resident IDs.
    • Implemented validation logic to ensure expert offloading is mutually exclusive with KTransformers offloading.
  • python/sglang/srt/utils/common.py
    • Added a mechanism to register UVM-evictable memory, which helps in accurately reporting available GPU memory and preventing false out-of-memory errors during initialization.
Activity
  • The initial commit focused on establishing the core UVM-based expert weight offloading infrastructure, including new modules for configuration, manager, UVM operations, prefetch strategies, and a wrapper for integration.
  • A subsequent commit introduced per-layer adaptive resident expert selection via frequency warmup, adding logic to manager.py and config.py to track expert usage and re-advise memory based on observed frequencies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature for UVM-based MoE expert offloading. The use of CUDA Unified Memory to simplify the architecture while maintaining full quality during decode is a great approach. The code is well-structured, and the inclusion of a detailed design document is highly appreciated. My review identifies a few areas for improvement, including a potential bug in argument parsing, some leftover dead code, and a misleading log message that should be corrected to accurately reflect the feature's capabilities. Overall, this is a solid contribution that will be even better with these minor adjustments.

Comment on lines +67 to +70
resident_ids = [
int(x.strip())
for x in server_args.expert_offload_resident_ids.split(",")
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for parsing expert_offload_resident_ids is susceptible to a ValueError if the string contains consecutive commas (e.g., '1,2,,3') or a trailing comma, as x.strip() would be an empty string, and int('') is invalid. It's safer to filter out empty strings before converting to integers.

Suggested change
resident_ids = [
int(x.strip())
for x in server_args.expert_offload_resident_ids.split(",")
]
resident_ids = [
int(x.strip())
for x in server_args.expert_offload_resident_ids.split(",")
if x.strip()
]

Comment on lines +498 to +504
if model_runner.server_args.expert_offload_num_resident >= 0:
log_info_on_rank0(
logger,
"[ExpertOffload] CUDA graph mode: decode uses static path (resident experts only). "
"Prefill always uses dynamic path (full quality). "
"Use --disable-cuda-graph for full decode quality.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This log message is misleading. It states that with CUDA graph mode, decode uses a static path with 'resident experts only', implying a loss of quality. However, the design document (expert_offload_plan.md) and the implementation show that UVM allows transparent access to offloaded experts via PCIe read-through, ensuring full quality even during CUDA graph replay. The log message should be corrected to reflect that there is no quality degradation.

        if model_runner.server_args.expert_offload_num_resident >= 0:
            log_info_on_rank0(
                logger,
                "[ExpertOffload] CUDA graph mode enabled. Offloaded experts will be accessed "
                "transparently via PCIe, ensuring full decode quality.",
            )

--host 0.0.0.0 --port 9090 \
--page-size 256 \
--mem-fraction-static 0.85 --tensor-parallel-size 8 \
--hicache-io-backend kernel \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor inconsistency in the test command. Here, --hicache-io-backend is set to kernel, but in the pull request description, it's set to direct. For clarity and reproducibility, it would be best to ensure these commands are consistent.

Comment on lines +264 to +294
def prefetch_experts(
self,
expert_ids: List[int],
stream: Optional[torch.cuda.Stream] = None,
) -> None:
"""Speculatively prefetch expert pages to GPU VRAM.

Call this with predicted expert IDs *before* a CUDA graph decode step
to upgrade PCIe-speed accesses to VRAM-speed accesses for those experts.
Correctness does not depend on this call — it is a pure optimisation.

``expert_ids`` should only contain offloaded (non-resident) expert IDs.
"""
if stream is None:
stream = self.prefetch_stream

# We need managed tensor references to issue prefetch on.
# Iterate over the first param to get the managed tensor.
if not self._expert_param_names:
return

# NOTE: We assume all params have the same expert ordering.
# The caller is responsible for providing valid offloaded expert IDs.
for expert_id in expert_ids:
if expert_id < 0 or expert_id >= self.config.num_local_experts:
continue
# Each managed tensor holds expert weights; prefetch per-param.
# Access managed tensors stored on the layer via the stored names.
# (The actual managed tensors are on the layer object itself.)
# This method is called by the wrapper which has access to layer.
pass # See ExpertOffloadWrapperMethod.prefetch_experts_on_layer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The prefetch_experts method appears to be unused. The prefetching logic is implemented in ExpertOffloadWrapperMethod.prefetch_experts_on_layer, which does not call this method. To improve clarity and remove dead code, consider removing this method from the ExpertOffloadManager.

Comment on lines +1 to +137
# Copyright 2024 SGLang Team
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Prefetch strategies for expert weight offloading."""

from __future__ import annotations

from abc import ABC, abstractmethod
from typing import List, Optional

import torch


class ExpertPrefetchStrategy(ABC):
"""Abstract base class for expert prefetch strategies."""

@abstractmethod
def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
"""Predict which expert IDs to prefetch for the next layer.

Args:
topk_ids: Current layer's top-k expert IDs [batch, top_k].
layer_idx: Current layer index.

Returns:
List of expert IDs to prefetch (may be empty).
"""
...

def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
"""Update internal statistics after seeing actual expert usage."""
pass


class NoPrefetch(ExpertPrefetchStrategy):
"""No-op prefetch strategy."""

def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
return []


class SpeculativePrefetch(ExpertPrefetchStrategy):
"""Cross-layer correlation-based speculative prefetch.

Maintains a co-occurrence matrix: ``corr[i, j]`` counts how often expert
``i`` in layer L predicts expert ``j`` in layer L+1. After a warmup
period, for each expert selected in the current layer we look up its
top-correlated next-layer experts and return those as prefetch candidates.
"""

WARMUP_STEPS = 32

def __init__(self, num_experts: int, num_layers: int, num_predictions: int):
self.num_experts = num_experts
self.num_layers = num_layers
self.num_predictions = num_predictions
self._step = 0

# corr[layer, expert_prev, expert_next] — updated online.
self.corr = torch.zeros(num_layers, num_experts, num_experts, dtype=torch.float32)
# Expert selections from the previous step, per layer.
self._prev_ids: Optional[torch.Tensor] = None

def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
if self._step < self.WARMUP_STEPS:
return []
next_layer = layer_idx + 1
if next_layer >= self.num_layers:
return []
# Accumulate correlation scores for the next layer.
scores = torch.zeros(self.num_experts, dtype=torch.float32)
for eid in topk_ids.unique().tolist():
eid = int(eid)
if 0 <= eid < self.num_experts:
scores += self.corr[next_layer, eid]
_, top_ids = scores.topk(min(self.num_predictions, self.num_experts))
return top_ids.tolist()

def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
self._step += 1
unique_ids = [int(x) for x in topk_ids.unique().tolist() if x >= 0]
if self._prev_ids is not None and layer_idx > 0:
prev_unique = [int(x) for x in self._prev_ids.unique().tolist() if x >= 0]
for prev in prev_unique:
for curr in unique_ids:
if 0 <= prev < self.num_experts and 0 <= curr < self.num_experts:
self.corr[layer_idx, prev, curr] += 1.0
# Save current ids as "previous" for the next layer's update call.
self._prev_ids = topk_ids.detach().cpu()


class FrequencyPrefetch(ExpertPrefetchStrategy):
"""Always prefetch the globally most-frequent experts."""

def __init__(self, num_experts: int, num_predictions: int):
self.num_experts = num_experts
self.num_predictions = num_predictions
self.freq = torch.zeros(num_experts, dtype=torch.long)
self._top_ids: List[int] = []

def predict(self, topk_ids: torch.Tensor, layer_idx: int) -> List[int]:
return self._top_ids

def update(self, topk_ids: torch.Tensor, layer_idx: int) -> None:
for eid in topk_ids.unique().tolist():
eid = int(eid)
if 0 <= eid < self.num_experts:
self.freq[eid] += 1
_, top_ids = self.freq.topk(min(self.num_predictions, self.num_experts))
self._top_ids = top_ids.tolist()


def create_prefetch_strategy(
strategy_name: str,
num_experts: int,
num_layers: int,
num_cache_slots: int,
) -> ExpertPrefetchStrategy:
"""Factory function for prefetch strategies."""
if strategy_name == "none":
return NoPrefetch()
elif strategy_name == "speculative":
return SpeculativePrefetch(num_experts, num_layers, num_predictions=num_cache_slots)
elif strategy_name == "frequency":
return FrequencyPrefetch(num_experts, num_predictions=num_cache_slots)
else:
raise ValueError(f"Unknown prefetch strategy: {strategy_name!r}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file seems to contain code from an earlier design and is not imported or used anywhere in the current implementation. It should be removed to avoid confusion and keep the codebase clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant