Skip to content

2026.01.13 cat bf16 kernels#538

Open
dbarry9 wants to merge 2 commits intoicl-utk-edu:masterfrom
dbarry9:2026.01.13_cat-bf16-kernels
Open

2026.01.13 cat bf16 kernels#538
dbarry9 wants to merge 2 commits intoicl-utk-edu:masterfrom
dbarry9:2026.01.13_cat-bf16-kernels

Conversation

@dbarry9
Copy link
Contributor

@dbarry9 dbarry9 commented Jan 20, 2026

Pull Request Description

This pull request:

  • revises FLOPs kernels to accounts for speculative FLOPs and
  • introduces kernels that use dot product intrinsics to trigger BF16 operations.

These changes have been tested on the ARM Neoverse V2 and Intel Sapphire Rapids architectures.

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess
Copy link
Contributor

I am reviewing this PR.

#define SET_VEC_SS(_I_) _I_ ;
#define ADD_VEC_SS(_I_,_J_) _I_ + _J_ ;
#define MUL_VEC_SS(_I_,_J_) _I_ * _J_ ;
#include <arm_fp16.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Should this be underneath a #if defined?


typedef __m128 SP_SCALAR_TYPE;
typedef __m128d DP_SCALAR_TYPE;
#if defined(AVX512_BF16_AVAIL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: It appears that for a defined architecture e.g. x86 you have multiple #if defined's for AVX512_BF16_AVAIL and AVX512_FP16_AVAIL. Why take this approach and not place them all under one AVX512_BF16_AVAIL or AVX512_FP16_AVAIL?

if ( PAPI_start( EventSet ) != PAPI_OK ) {
return -1;
}
if ( NULL != fp && PAPI_start( EventSet ) != PAPI_OK ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

rB = MUL_VEC_SS(rB,rC);

/* Stop PAPI counters */
if ( NULL != fp && PAPI_stop(EventSet, iterValues) != PAPI_OK ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This is the last instance I will put such that I don't continue to clog up the review, but there are a few more instances that I will not comment on that will need to be changed if you do decide to return the error code returned from PAPI_stop.

INCFLAGS=-I$(PAPI_DIR)/include
CFLAGS+=-g -Wall -Wextra
OPT0=-O0
OPT1=-O1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A behavior I noticed in this PR, but appears to occur in master is the following:

  1. On x86, build CAT for the first time. This will create cat_collect.
  2. Move to another architecture lets say aarch, go into the counter_analysis_toolkit run make clean and then run make. Everything will build successfully.
  3. When I attempt to run ./cat_collect I run into the following:
[tburgess@hopper1 counter_analysis_toolkit]$ ./cat_collect -in event_list.txt -out OUTPUT_DIRECTORY -flops
-bash: ./cat_collect: cannot execute binary file: Exec format error

I have to then remove cat_collect and re run make such that step 3 will run.

@dbarry9 dbarry9 force-pushed the 2026.01.13_cat-bf16-kernels branch 2 times, most recently from eaaa165 to 7f665b3 Compare February 2, 2026 22:29
This new kernel structure accounts for speculative FLOPs.

These changes have been tested on the ARM Neoverse V2 architecture.
This introduces kernels that use dot product intrinsics to trigger
BF16 operations.

These changes have been tested on the ARM Neoverse V2 and Intel
Sapphire Rapids architectures.
@dbarry9 dbarry9 force-pushed the 2026.01.13_cat-bf16-kernels branch from 7f665b3 to 3b25d11 Compare February 2, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants