Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306

yuslepukhin · 2025-10-14T21:38:44Z

Fix MLAS issue that affects minimal build behavior

This PR addresses #26174

When ONNX Runtime optimizes a model that involves an NCHWC layout transformation, it inserts ReorderInput and ReorderOutput nodes. The resulting input shapes must be compatible with the W input of the subsequent Conv node. The ReorderInput node calculates its output shape based on the NchwcBlockSize for the given platform, which is expected because minimal-build models are platform-dependent.

However, MLAS has incorrectly placed the #ifdef ORT_MINIMAL_BUILD directive, resulting in a different value for the NchwcBlockSize constant between minimal and full builds—even though both are intended to run on the same platform.
This discrepancy causes inference to fail.

NCHWC Transformer Test Additions

Added multiple new tests to graph_transform_test.cc that verify NchwcTransformer correctly transforms ONNX graphs containing Conv, Conv+Relu, Conv+unsupported activation, and MaxPool nodes into their NCHWC equivalents, including checks for correct operator replacement and attribute handling.
Included the nchwc_transformer.h header in test builds when contrib ops are enabled, ensuring the transformer is available for testing.

Platform and Alignment Updates

Refined logic in platform.cpp to set NchwcBlockSize and PreferredBufferAlignment only when specific CPU features are detected, improving accuracy of platform-specific optimizations. [1] [2]

Test Infrastructure Improvements

Updated graph transformer test builder to register the NCHWC domain, enabling proper opset versioning for NCHWC operators in test models.
Enabled model dumping for debugging by setting SAVE_TEST_GRAPH to 1 in graph_transform_test_builder.cc.### Description

Copilot

Pull Request Overview

This PR fixes an MLAS issue affecting minimal build behavior where NchwcBlockSize values differed between minimal and full builds on the same platform, causing inference failures. The fix moves platform-specific optimizations outside the minimal build guard and adds comprehensive test coverage for the NCHWC transformer.

Corrected MLAS platform detection logic to ensure consistent NchwcBlockSize values across build types
Added comprehensive test coverage for NCHWC transformer functionality across different operator scenarios
Enabled model dumping for debugging purposes in test infrastructure

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`onnxruntime/core/mlas/lib/platform.cpp`	Fixed platform-specific optimization logic to work consistently in both minimal and full builds
`onnxruntime/test/optimizer/graph_transform_test.cc`	Added comprehensive NCHWC transformer tests for Conv, Conv+Relu, Conv+unsupported activation, and MaxPool scenarios
`onnxruntime/test/unittest_util/graph_transform_test_builder.cc`	Enabled model dumping for debugging and added NCHWC domain registration for proper test execution

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

onnxruntime/test/unittest_util/graph_transform_test_builder.cc

hariharans29 · 2025-10-14T22:38:56Z

onnxruntime/core/mlas/lib/platform.cpp

                    this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni;
                }

+                if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {


I am not an expert on minimal builds and their use-case but something doesn't add up for me or maybe I am not understanding things right.
I would have thought the NCHWcBlockSize goes hand-in-hand with the kernel that actually uses it which is the NCHWcConvKernel. Why would there be a (minimal) build flavor that updates the NCHWcBlockSize when the kernel that supposedly uses it is not part of the build?

IIRC AVX512 kernels are excluded in a minimal build due to the binary size cost, so unless that has changed the ifdef here was to match that the kernels weren't available.

Exactly. I think AVX512 kernels seem excluded from minimal builds - that was my understanding as well. So, if they are excluded, metadata of the kernels like the NCHWcBlockSize (which basically support the excluded NCHWc AVX512 Conv kernel) should also be excluded from minimal builds right? (i.e.) isn't the current setup correct?

Please, re-read the PR summary. Revisiting ReorderInput implementation may also be helpful. NCHWcBlockSize is always present in the build with the default value and it can be used at any time. However, the default value is different in minimal build for the same platform from that of the full build for AMD64. This results in optimizations for NCHWc done with one value and runtime mismatch with another value.

Not sure if this comment was before or after our call, but I understand that NCHWcBlockSize is always present. I just felt dicey about potentially matching the NCHWc block size that was clearly set for the AVX512 NCHWc Conv kernel with (let's pick an example) the SSE2 NCHWc Conv kernel. Its correctness and perf implications are not very clear.

hariharans29 · 2025-10-14T23:22:05Z

Could the issue be something else - (i.e.) the ORT model is produced on an AMD64 platform that doesn't support AVX512F and is tried to run on an AMD64 platform that supports AVX512F or vice-versa ?

yuslepukhin · 2025-10-15T17:25:03Z

Could the issue be something else - (i.e.) the ORT model is produced on an AMD64 platform that doesn't support AVX512F and is tried to run on an AMD64 platform that supports AVX512F or vice-versa ?

We can chat offline

edgchen1 · 2025-10-17T15:02:41Z

onnxruntime/core/mlas/lib/platform.cpp

                    this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchAvx2vnni;
                }

+                if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {


this appears to be copied from here:

onnxruntime/onnxruntime/core/mlas/lib/platform.cpp

Lines 447 to 452 in 878863d

//

// Check if the processor supports AVX512F features and the

// operating system supports saving AVX512F state.

//

if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {

the expression is kind of cryptic. can we instead store the result in a constant with a meaningful name and reuse that?

yuslepukhin added 3 commits October 14, 2025 14:18

Implement mlas fix and start adding layout transformer tests

2ac99c1

Add NchwcTransformerConvRelu

6113c37

Add more tests

f689ef7

yuslepukhin requested review from Copilot, rui-ren, skottmckay and xadupre October 14, 2025 21:38

Copilot AI reviewed Oct 14, 2025

View reviewed changes

onnxruntime/test/unittest_util/graph_transform_test_builder.cc Outdated Show resolved Hide resolved

Remove debug define

7723e16

yuslepukhin requested review from edgchen1 and hariharans29 October 14, 2025 21:42

yuslepukhin added the release:1.23.2 label Oct 14, 2025

Remove unused node

eaebb08

hariharans29 reviewed Oct 14, 2025

View reviewed changes

yuslepukhin added 2 commits October 15, 2025 15:36

Exclude tests for training builds

bdbc1be

Remove tests

5e2466f

edgchen1 reviewed Oct 17, 2025

View reviewed changes

yuslepukhin removed the release:1.23.2 label Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306

Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306

yuslepukhin commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

hariharans29 Oct 14, 2025

Uh oh!

skottmckay Oct 15, 2025

Uh oh!

hariharans29 Oct 15, 2025

Uh oh!

yuslepukhin Oct 15, 2025 •

edited

Loading

Uh oh!

hariharans29 Oct 16, 2025

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

yuslepukhin commented Oct 15, 2025

Uh oh!

edgchen1 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	//
	// Check if the processor supports AVX512F features and the
	// operating system supports saving AVX512F state.
	//

	if (((Cpuid7[1] & 0x10000) != 0) && ((xcr0 & 0xE0) == 0xE0)) {

Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306

Are you sure you want to change the base?

Address MLAS NchwcBlockSize for AMD64 platforms behavior in the minimal builds #26306

Conversation

yuslepukhin commented Oct 14, 2025

Fix MLAS issue that affects minimal build behavior

NCHWC Transformer Test Additions

Platform and Alignment Updates

Test Infrastructure Improvements

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

hariharans29 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

skottmckay Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

yuslepukhin Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hariharans29 Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Oct 14, 2025

Uh oh!

yuslepukhin commented Oct 15, 2025

Uh oh!

edgchen1 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuslepukhin Oct 15, 2025 •

edited

Loading