Skip to content

Conversation

yuslepukhin
Copy link
Member

Fix MLAS issue that affects minimal build behavior

This PR addresses #26174

When ONNX Runtime optimizes a model that involves an NCHWC layout transformation, it inserts ReorderInput and ReorderOutput nodes. The resulting input shapes must be compatible with the W input of the subsequent Conv node. The ReorderInput node calculates its output shape based on the NchwcBlockSize for the given platform, which is expected because minimal-build models are platform-dependent.

However, MLAS has incorrectly placed the #ifdef ORT_MINIMAL_BUILD directive, resulting in a different value for the NchwcBlockSize constant between minimal and full builds—even though both are intended to run on the same platform.
This discrepancy causes inference to fail.

NCHWC Transformer Test Additions

  • Added multiple new tests to graph_transform_test.cc that verify NchwcTransformer correctly transforms ONNX graphs containing Conv, Conv+Relu, Conv+unsupported activation, and MaxPool nodes into their NCHWC equivalents, including checks for correct operator replacement and attribute handling.
  • Included the nchwc_transformer.h header in test builds when contrib ops are enabled, ensuring the transformer is available for testing.

Platform and Alignment Updates

  • Refined logic in platform.cpp to set NchwcBlockSize and PreferredBufferAlignment only when specific CPU features are detected, improving accuracy of platform-specific optimizations. [1] [2]

Test Infrastructure Improvements

  • Updated graph transformer test builder to register the NCHWC domain, enabling proper opset versioning for NCHWC operators in test models.
  • Enabled model dumping for debugging by setting SAVE_TEST_GRAPH to 1 in graph_transform_test_builder.cc.### Description

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an MLAS issue affecting minimal build behavior where NchwcBlockSize values differed between minimal and full builds on the same platform, causing inference failures. The fix moves platform-specific optimizations outside the minimal build guard and adds comprehensive test coverage for the NCHWC transformer.

  • Corrected MLAS platform detection logic to ensure consistent NchwcBlockSize values across build types
  • Added comprehensive test coverage for NCHWC transformer functionality across different operator scenarios
  • Enabled model dumping for debugging purposes in test infrastructure

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/mlas/lib/platform.cpp Fixed platform-specific optimization logic to work consistently in both minimal and full builds
onnxruntime/test/optimizer/graph_transform_test.cc Added comprehensive NCHWC transformer tests for Conv, Conv+Relu, Conv+unsupported activation, and MaxPool scenarios
onnxruntime/test/unittest_util/graph_transform_test_builder.cc Enabled model dumping for debugging and added NCHWC domain registration for proper test execution

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni;
}

if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert on minimal builds and their use-case but something doesn't add up for me or maybe I am not understanding things right.
I would have thought the NCHWcBlockSize goes hand-in-hand with the kernel that actually uses it which is the NCHWcConvKernel. Why would there be a (minimal) build flavor that updates the NCHWcBlockSize when the kernel that supposedly uses it is not part of the build?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC AVX512 kernels are excluded in a minimal build due to the binary size cost, so unless that has changed the ifdef here was to match that the kernels weren't available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. I think AVX512 kernels seem excluded from minimal builds - that was my understanding as well. So, if they are excluded, metadata of the kernels like the NCHWcBlockSize (which basically support the excluded NCHWc AVX512 Conv kernel) should also be excluded from minimal builds right? (i.e.) isn't the current setup correct?

Copy link
Member Author

@yuslepukhin yuslepukhin Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, re-read the PR summary. Revisiting ReorderInput implementation may also be helpful. NCHWcBlockSize is always present in the build with the default value and it can be used at any time. However, the default value is different in minimal build for the same platform from that of the full build for AMD64. This results in optimizations for NCHWc done with one value and runtime mismatch with another value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this comment was before or after our call, but I understand that NCHWcBlockSize is always present. I just felt dicey about potentially matching the NCHWc block size that was clearly set for the AVX512 NCHWc Conv kernel with (let's pick an example) the SSE2 NCHWc Conv kernel. Its correctness and perf implications are not very clear.

@hariharans29
Copy link
Member

Could the issue be something else - (i.e.) the ORT model is produced on an AMD64 platform that doesn't support AVX512F and is tried to run on an AMD64 platform that supports AVX512F or vice-versa ?

@yuslepukhin
Copy link
Member Author

Could the issue be something else - (i.e.) the ORT model is produced on an AMD64 platform that doesn't support AVX512F and is tried to run on an AMD64 platform that supports AVX512F or vice-versa ?

We can chat offline

this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni;
}

if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears to be copied from here:

//
// Check if the processor supports AVX512F features and the
// operating system supports saving AVX512F state.
//
if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {

the expression is kind of cryptic. can we instead store the result in a constant with a meaningful name and reuse that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants