Skip to content

Fix noncontiguous input for rmsnorm#117

Open
yangw1234 wants to merge 3 commits intosgl-project:mainfrom
yangw1234:fix_norn_noncontiguous
Open

Fix noncontiguous input for rmsnorm#117
yangw1234 wants to merge 3 commits intosgl-project:mainfrom
yangw1234:fix_norn_noncontiguous

Conversation

@yangw1234
Copy link

@yangw1234 yangw1234 commented Mar 7, 2026

The issue

Deepseek vl2 small and deepseek coder lite will produce garbage output.
The root cause is that the rmsnorm cannot support non-contiguous input.
This PR provides a workaround.

@yangw1234
Copy link
Author

@airMeng could you take a look?

output: torch.Tensor
Normalized tensor, shape (batch_size, hidden_size).
"""
input = input.contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding a check whether the input is contiguous?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't .contiguous() a no-op when the tensor is already contiguous?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, like

    if not input.is_contiguous():
        warnings.warn("Input of RMSNorm is not contiguous. Converting to contiguous tensor.")
        input = input.contiguous()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this print too much log in the sglang server log?

@airMeng
Copy link
Collaborator

airMeng commented Mar 8, 2026

Generally I would suggest to refactor the kernel to support the non contiguous input rather than do reordering implicitly which might make performance analysis complex. If not possible, probably throw a warning then do the reorder

@yangw1234
Copy link
Author

Generally I would suggest to refactor the kernel to support the non contiguous input rather than do reordering implicitly which might make performance analysis complex. If not possible, probably throw a warning then do the reorder

Yes, I believe we will eventually support it in the kernel. But I don't think this is the bottleneck right now. I don't think adding an op will complicate the analysis, since a op is pretty obvious in the profiling trace. Throwing a warning is probably too noisy.

@airMeng airMeng requested a review from mingfeima March 9, 2026 01:01
output: torch.Tensor
Normalized tensor, shape (batch_size, hidden_size).
"""
input = input.contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not fix this issue by "converting input to contiguous" tensor.

this would result into a silent memcpy if non-contig input is met.

how to fix it properly?

  • fix it in C++ kernels
  • add non-contiguous support

i prefer to do contiguous checks in C++ kernels:

TORCH_CHECK(input.is_contiguous(), "xxx");

if we have a non contig input, just fix it. That would be better than a silent memcpy.

Copy link
Author

@yangw1234 yangw1234 Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can wait for the c++ kernel fix. SGLANGT-500 is tracking this. I'll leave this PR as is to make sure people know this problem.

@mingfeima
Copy link
Collaborator

avoid using input = input.contiguous() in both python and C++.

@mingfeima
Copy link
Collaborator

Generally I would suggest to refactor the kernel to support the non contiguous input rather than do reordering implicitly which might make performance analysis complex. If not possible, probably throw a warning then do the reorder

Yes, I believe we will eventually support it in the kernel. But I don't think this is the bottleneck right now. I don't think adding an op will complicate the analysis, since a op is pretty obvious in the profiling trace. Throwing a warning is probably too noisy.

不是这样的,防微杜渐,避免积重难返。
写模型的人可以这么做,但是做性能的人不能这么想。

@yangw1234
Copy link
Author

yangw1234 commented Mar 10, 2026

Generally I would suggest to refactor the kernel to support the non contiguous input rather than do reordering implicitly which might make performance analysis complex. If not possible, probably throw a warning then do the reorder

Yes, I believe we will eventually support it in the kernel. But I don't think this is the bottleneck right now. I don't think adding an op will complicate the analysis, since a op is pretty obvious in the profiling trace. Throwing a warning is probably too noisy.

不是这样的,防微杜渐,避免积重难返。 写模型的人可以这么做,但是做性能的人不能这么想。

That's a very good point, @mingfeima, I totally agree with this from the performance engineering perspective. But from a user experience point view, model output garbage is a such a bad experience that we will lose the user/developer trust instantly if he/she met the problem, while a few percent performance drop doesn't seem matter that much. In addition, looking at the whole development process, this can not only unlock people working on model/feature/CI enabling but also enable the folks working on low level kernels/performance to prioritize the kernels which are truly the bottleneck. That's why I prefer moving fast and working around such things as soon as possible.

Let me know what's your thoughts @mingfeima and @airMeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants