Fix noncontiguous input for rmsnorm#117
Conversation
|
@airMeng could you take a look? |
| output: torch.Tensor | ||
| Normalized tensor, shape (batch_size, hidden_size). | ||
| """ | ||
| input = input.contiguous() |
There was a problem hiding this comment.
how about adding a check whether the input is contiguous?
There was a problem hiding this comment.
Isn't .contiguous() a no-op when the tensor is already contiguous?
There was a problem hiding this comment.
yes, like
if not input.is_contiguous():
warnings.warn("Input of RMSNorm is not contiguous. Converting to contiguous tensor.")
input = input.contiguous()There was a problem hiding this comment.
Won't this print too much log in the sglang server log?
|
Generally I would suggest to refactor the kernel to support the non contiguous input rather than do reordering implicitly which might make performance analysis complex. If not possible, probably throw a warning then do the reorder |
Yes, I believe we will eventually support it in the kernel. But I don't think this is the bottleneck right now. I don't think adding an op will complicate the analysis, since a op is pretty obvious in the profiling trace. Throwing a warning is probably too noisy. |
| output: torch.Tensor | ||
| Normalized tensor, shape (batch_size, hidden_size). | ||
| """ | ||
| input = input.contiguous() |
There was a problem hiding this comment.
we should not fix this issue by "converting input to contiguous" tensor.
this would result into a silent memcpy if non-contig input is met.
how to fix it properly?
- fix it in C++ kernels
- add non-contiguous support
i prefer to do contiguous checks in C++ kernels:
TORCH_CHECK(input.is_contiguous(), "xxx");
if we have a non contig input, just fix it. That would be better than a silent memcpy.
There was a problem hiding this comment.
We can wait for the c++ kernel fix. SGLANGT-500 is tracking this. I'll leave this PR as is to make sure people know this problem.
|
avoid using |
不是这样的,防微杜渐,避免积重难返。 |
That's a very good point, @mingfeima, I totally agree with this from the performance engineering perspective. But from a user experience point view, model output garbage is a such a bad experience that we will lose the user/developer trust instantly if he/she met the problem, while a few percent performance drop doesn't seem matter that much. In addition, looking at the whole development process, this can not only unlock people working on model/feature/CI enabling but also enable the folks working on low level kernels/performance to prioritize the kernels which are truly the bottleneck. That's why I prefer moving fast and working around such things as soon as possible. Let me know what's your thoughts @mingfeima and @airMeng |
The issue
Deepseek vl2 small and deepseek coder lite will produce garbage output.
The root cause is that the rmsnorm cannot support non-contiguous input.
This PR provides a workaround.