-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an MLAS issue affecting minimal build behavior where NchwcBlockSize
values differed between minimal and full builds on the same platform, causing inference failures. The fix moves platform-specific optimizations outside the minimal build guard and adds comprehensive test coverage for the NCHWC transformer.
- Corrected MLAS platform detection logic to ensure consistent
NchwcBlockSize
values across build types - Added comprehensive test coverage for NCHWC transformer functionality across different operator scenarios
- Enabled model dumping for debugging purposes in test infrastructure
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
onnxruntime/core/mlas/lib/platform.cpp |
Fixed platform-specific optimization logic to work consistently in both minimal and full builds |
onnxruntime/test/optimizer/graph_transform_test.cc |
Added comprehensive NCHWC transformer tests for Conv, Conv+Relu, Conv+unsupported activation, and MaxPool scenarios |
onnxruntime/test/unittest_util/graph_transform_test_builder.cc |
Enabled model dumping for debugging and added NCHWC domain registration for proper test execution |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni; | ||
} | ||
|
||
if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert on minimal builds and their use-case but something doesn't add up for me or maybe I am not understanding things right.
I would have thought the NCHWcBlockSize goes hand-in-hand with the kernel that actually uses it which is the NCHWcConvKernel. Why would there be a (minimal) build flavor that updates the NCHWcBlockSize when the kernel that supposedly uses it is not part of the build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC AVX512 kernels are excluded in a minimal build due to the binary size cost, so unless that has changed the ifdef here was to match that the kernels weren't available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. I think AVX512 kernels seem excluded from minimal builds - that was my understanding as well. So, if they are excluded, metadata of the kernels like the NCHWcBlockSize (which basically support the excluded NCHWc AVX512 Conv kernel) should also be excluded from minimal builds right? (i.e.) isn't the current setup correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, re-read the PR summary. Revisiting ReorderInput
implementation may also be helpful. NCHWcBlockSize
is always present in the build with the default value and it can be used at any time. However, the default value is different in minimal build for the same platform from that of the full build for AMD64. This results in optimizations for NCHWc done with one value and runtime mismatch with another value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this comment was before or after our call, but I understand that NCHWcBlockSize
is always present. I just felt dicey about potentially matching the NCHWc block size that was clearly set for the AVX512 NCHWc Conv kernel with (let's pick an example) the SSE2 NCHWc Conv kernel. Its correctness and perf implications are not very clear.
Could the issue be something else - (i.e.) the ORT model is produced on an AMD64 platform that doesn't support AVX512F and is tried to run on an AMD64 platform that supports AVX512F or vice-versa ? |
We can chat offline |
this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni; | ||
} | ||
|
||
if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears to be copied from here:
onnxruntime/onnxruntime/core/mlas/lib/platform.cpp
Lines 447 to 452 in 878863d
// | |
// Check if the processor supports AVX512F features and the | |
// operating system supports saving AVX512F state. | |
// | |
if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) { |
the expression is kind of cryptic. can we instead store the result in a constant with a meaningful name and reuse that?
Fix MLAS issue that affects minimal build behavior
This PR addresses #26174
When ONNX Runtime optimizes a model that involves an NCHWC layout transformation, it inserts
ReorderInput
andReorderOutput
nodes. The resulting input shapes must be compatible with the W input of the subsequent Conv node. TheReorderInput
node calculates its output shape based on theNchwcBlockSize
for the given platform, which is expected because minimal-build models are platform-dependent.However, MLAS has incorrectly placed the
#ifdef ORT_MINIMAL_BUILD
directive, resulting in a different value for theNchwcBlockSize
constant between minimal and full builds—even though both are intended to run on the same platform.This discrepancy causes inference to fail.
NCHWC Transformer Test Additions
graph_transform_test.cc
that verifyNchwcTransformer
correctly transforms ONNX graphs containingConv
,Conv+Relu
,Conv+unsupported activation
, andMaxPool
nodes into their NCHWC equivalents, including checks for correct operator replacement and attribute handling.nchwc_transformer.h
header in test builds when contrib ops are enabled, ensuring the transformer is available for testing.Platform and Alignment Updates
platform.cpp
to setNchwcBlockSize
andPreferredBufferAlignment
only when specific CPU features are detected, improving accuracy of platform-specific optimizations. [1] [2]Test Infrastructure Improvements
SAVE_TEST_GRAPH
to 1 ingraph_transform_test_builder.cc
.### Description