Skip to content

[Snippets][CPU] Support static and dynamic offsets in JIT Gemm and GemmCopyB emitters #31375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

aobolensk
Copy link
Contributor

Details:

  • Add offset handling (m_memory_offsets, m_buffer_ids) to AArch64 Gemm/GemmCopyB emitters
  • Introduce new AArch64 utils helpers for safe GPR selection and stack‑based pointer patching

Tickets:

  • 169427

@aobolensk aobolensk requested review from a team as code owners July 17, 2025 16:56
@aobolensk aobolensk added the platform: arm OpenVINO on ARM / ARM64 label Jul 17, 2025
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Jul 17, 2025
@aobolensk aobolensk force-pushed the snippets-arm-matmul-offsets branch from cc4e8fb to c1ddb94 Compare July 17, 2025 17:01
@a-sidorova
Copy link
Contributor

@chenhu-wang May I ask you to review this PR please since you're an author of these JIT emitters?

@aobolensk aobolensk force-pushed the snippets-arm-matmul-offsets branch from 744ad13 to 36c75d7 Compare July 21, 2025 06:16
Copy link
Contributor

@chenhu-wang chenhu-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@aobolensk
Copy link
Contributor Author

@chenhu-wang had to revert gpr related change, because it crashes on some test suites.

I have investigated the root cause. The stack should be 16 bytes aligned according to ARM spec:

This reverts commit 7dc7b92.
@chenhu-wang
Copy link
Contributor

@chenhu-wang had to revert gpr related change, because it crashes on some test suites.

I have investigated the root cause. The stack should be 16 bytes aligned according to ARM spec:

Thank you for sharing.


namespace ov::intel_cpu::aarch64::utils {

size_t get_buffer_cluster_id(const ov::snippets::lowered::ExpressionPort& port) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this helper for aarch64 and helper for x64 are the same, what's about to move them into one common file in emitters/snippets/utils/util.* for example?
These helpers aren't depended on arch so can be reused on all platforms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +510 to +511
// TODO: support for long offsets is required for MatMulBias tests with bigger shapes
retVector.emplace_back(R"(smoke_Snippets_MatMulBias.*1023.*)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it functional problem or limitation? Can we handle it in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, this problem should be resolved in this PR. Let's do our best to find root cause and fix it 😊

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no

[ RUN      ] smoke_Snippets_MatMulBias/MatMulBias.CompareWithRefImpl/IS[0]=[]_([1.2.95.1023])_IS[1]=[]_([1.2.1023.255])_IS[2]=[]_([1.2.95.255])_T[0]=f32_T[1]=f32_MatMul_#N=1_#S=1_targetDevice=CPU

MEM_USAGE=5129428KB
bad err=15 in Xbyak::Error
src/tests/functional/base_func_tests/src/base/ov_subgraph.cpp:97: Failure
Exception from src/inference/src/cpp/core.cpp:112:
Exception from src/inference/src/dev/plugin.cpp:53:
illegal immediate parameter (range error)


unknown file: Failure
C++ exception with description "Exception from src/inference/src/cpp/compiled_model.cpp:56:
CompiledModel was not initialized.
" thrown in the test body.

}

// For non-zero offsets, apply the offset and then store
Xbyak_aarch64::XReg temp_reg(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's is the magic number 4? What's happened if ptr_reg is XReg(4) too? We can corrupt initial value in XReg(4).

To solve this problem, can we pass temp_reg as the argument of the helper as it's done in push_ptr_with_runtime_offset_on_stack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// For large offsets, use a register to hold the offset value
if (ptr_offset > 4095) { // 12-bit immediate limit for add instruction
Xbyak_aarch64::XReg offset_reg(6);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment: we should avoid initialization of register by explicit numbers in any helpers since they might be called in any code parts. This is unsafe way if somewhere XReg(6) is already live register during push_ptr_with_static_offset_on_stack call

Please avoid any initialization of the register by explicit number in helpers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we just use TMP_X_0 for example here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

size_t runtime_offset);

/**
* @brief Push data pointer on stack adding static offset `ptr_offset`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment that these helpers don't allocate stack pointer - the user should guarantee allocated space on stack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be in the both helpers: push_ptr_with_runtime_offset_on_stack and push_ptr_with_static_offset_on_stack :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@aobolensk aobolensk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied comments

}

// For non-zero offsets, apply the offset and then store
Xbyak_aarch64::XReg temp_reg(4);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


// For large offsets, use a register to hold the offset value
if (ptr_offset > 4095) { // 12-bit immediate limit for add instruction
Xbyak_aarch64::XReg offset_reg(6);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@aobolensk aobolensk force-pushed the snippets-arm-matmul-offsets branch 3 times, most recently from 1637662 to 9cef4e0 Compare July 22, 2025 14:22
default:
OV_CPU_JIT_EMITTER_THROW("Uknown type of expression port!");
}
OV_CPU_JIT_EMITTER_ASSERT(IMPLICATION(ov::snippets::utils::is_dynamic_value(offset), id != SIZE_MAX),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
we can reuse existing implication helper here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

size_t runtime_offset);

/**
* @brief Push data pointer on stack adding static offset `ptr_offset`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be in the both helpers: push_ptr_with_runtime_offset_on_stack and push_ptr_with_static_offset_on_stack :)


// Load the runtime offset from abi_param1 (X0) and add it to the pointer
Xbyak_aarch64::XReg abi_param1(0);
Xbyak_aarch64::XReg offset_reg(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see magic XReg(4) here. Please avoid any initializations of registers by const numbers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 100 to 101
h->mov(temp_reg, ptr_reg);
h->add_imm(temp_reg, temp_reg, ptr_offset, Xbyak_aarch64::XReg(h->X_TMP_1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h->mov(temp_reg, ptr_reg);
h->add_imm(temp_reg, temp_reg, ptr_offset, Xbyak_aarch64::XReg(h->X_TMP_1));
h->add_imm(temp_reg, ptr_reg, ptr_offset, Xbyak_aarch64::XReg(h->X_TMP_1));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Load the offset value from the runtime parameter location
h->add_imm(temp_reg, abi_param1, runtime_offset, Xbyak_aarch64::XReg(h->X_TMP_1));
h->ldr(offset_reg, Xbyak_aarch64::ptr(temp_reg));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to replace offset_reg with temp_reg here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +510 to +511
// TODO: support for long offsets is required for MatMulBias tests with bigger shapes
retVector.emplace_back(R"(smoke_Snippets_MatMulBias.*1023.*)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, this problem should be resolved in this PR. Let's do our best to find root cause and fix it 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin platform: arm OpenVINO on ARM / ARM64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants