Skip to content

[WebAssembly] Implement getInterleavedMemoryOpCost #146864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sparker-arm
Copy link
Contributor

First pass where we calculate the cost of the memory operation, as well as the shuffles required. Interleaving by a factor of two should be relatively cheap, as many ISAs have dedicated instructions to perform the (de)interleaving. Several of these permutations can be combined for an interleave stride of 4 and this is the highest stride we allow.

I've costed larger vectors, and more lanes, as more expensive because not only is more work is needed but the risk of codegen going 'wrong' rises dramatically. I also filled in a bit of cost modelling for vector stores.

It appears the main vector plan to avoid is an interleave factor of 4 with v16i8. I've used libyuv and ncnn for benchmarking, using V8 on AArch64, and observe geomean improvement of ~3% with some kernels improving 40-60%.

I know there is still significant performance being left on the table, so this will need more development along with the rest of the cost model.

@llvmbot
Copy link
Member

llvmbot commented Jul 3, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-webassembly

Author: Sam Parker (sparker-arm)

Changes

First pass where we calculate the cost of the memory operation, as well as the shuffles required. Interleaving by a factor of two should be relatively cheap, as many ISAs have dedicated instructions to perform the (de)interleaving. Several of these permutations can be combined for an interleave stride of 4 and this is the highest stride we allow.

I've costed larger vectors, and more lanes, as more expensive because not only is more work is needed but the risk of codegen going 'wrong' rises dramatically. I also filled in a bit of cost modelling for vector stores.

It appears the main vector plan to avoid is an interleave factor of 4 with v16i8. I've used libyuv and ncnn for benchmarking, using V8 on AArch64, and observe geomean improvement of ~3% with some kernels improving 40-60%.

I know there is still significant performance being left on the table, so this will need more development along with the rest of the cost model.


Patch is 67.50 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146864.diff

3 Files Affected:

  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp (+105-16)
  • (modified) llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h (+4)
  • (added) llvm/test/Transforms/LoopVectorize/WebAssembly/memory-interleave.ll (+1353)
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
index 978e08bb89551..19cd0892127a2 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
@@ -150,12 +150,6 @@ InstructionCost WebAssemblyTTIImpl::getMemoryOpCost(
                                   CostKind);
   }
 
-  int ISD = TLI->InstructionOpcodeToISD(Opcode);
-  if (ISD != ISD::LOAD) {
-    return BaseT::getMemoryOpCost(Opcode, Ty, Alignment, AddressSpace,
-                                  CostKind);
-  }
-
   EVT VT = TLI->getValueType(DL, Ty, true);
   // Type legalization can't handle structs
   if (VT == MVT::Other)
@@ -166,22 +160,117 @@ InstructionCost WebAssemblyTTIImpl::getMemoryOpCost(
   if (!LT.first.isValid())
     return InstructionCost::getInvalid();
 
-  // 128-bit loads are a single instruction. 32-bit and 64-bit vector loads can
-  // be lowered to load32_zero and load64_zero respectively. Assume SIMD loads
-  // are twice as expensive as scalar.
+  int ISD = TLI->InstructionOpcodeToISD(Opcode);
   unsigned width = VT.getSizeInBits();
-  switch (width) {
-  default:
-    break;
-  case 32:
-  case 64:
-  case 128:
-    return 2;
+  if (ISD == ISD::LOAD) {
+    // 128-bit loads are a single instruction. 32-bit and 64-bit vector loads
+    // can be lowered to load32_zero and load64_zero respectively. Assume SIMD
+    // loads are twice as expensive as scalar.
+    switch (width) {
+    default:
+      break;
+    case 32:
+    case 64:
+    case 128:
+      return 2;
+    }
+  } else if (ISD == ISD::STORE) {
+    // For stores, we can use store lane operations.
+    switch (width) {
+    default:
+      break;
+    case 8:
+    case 16:
+    case 32:
+    case 64:
+    case 128:
+      return 2;
+    }
   }
 
   return BaseT::getMemoryOpCost(Opcode, Ty, Alignment, AddressSpace, CostKind);
 }
 
+InstructionCost WebAssemblyTTIImpl::getInterleavedMemoryOpCost(
+    unsigned Opcode, Type *Ty, unsigned Factor, ArrayRef<unsigned> Indices,
+    Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
+    bool UseMaskForCond, bool UseMaskForGaps) const {
+  assert(Factor >= 2 && "Invalid interleave factor");
+
+  auto *VecTy = cast<VectorType>(Ty);
+  if (!ST->hasSIMD128() || !isa<FixedVectorType>(VecTy)) {
+    return InstructionCost::getInvalid();
+  }
+
+  if (UseMaskForCond || UseMaskForGaps)
+    return BaseT::getInterleavedMemoryOpCost(Opcode, Ty, Factor, Indices,
+                                             Alignment, AddressSpace, CostKind,
+                                             UseMaskForCond, UseMaskForGaps);
+
+  constexpr unsigned MaxInterleaveFactor = 4;
+  if (Factor <= MaxInterleaveFactor) {
+    unsigned MinElts = VecTy->getElementCount().getKnownMinValue();
+    // Ensure the number of vector elements is greater than 1.
+    if (MinElts < 2 || MinElts % Factor != 0)
+      return InstructionCost::getInvalid();
+
+    unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());
+    // Ensure the element type is legal.
+    if (ElSize != 8 && ElSize != 16 && ElSize != 32 && ElSize != 64)
+      return InstructionCost::getInvalid();
+
+    auto *SubVecTy =
+        VectorType::get(VecTy->getElementType(),
+                        VecTy->getElementCount().divideCoefficientBy(Factor));
+    InstructionCost MemCost =
+        getMemoryOpCost(Opcode, SubVecTy, Alignment, AddressSpace, CostKind);
+
+    unsigned VecSize = DL.getTypeSizeInBits(SubVecTy);
+    unsigned MaxVecSize = 128;
+    unsigned NumAccesses =
+        std::max<unsigned>(1, (MinElts * ElSize + MaxVecSize - 1) / VecSize);
+
+    // A stride of two is commonly supported via dedicated instructions, so it
+    // should be relatively cheap for all element sizes. A stride of four is
+    // more expensive as it will likely require more shuffles. Using two
+    // simd128 inputs is considered more expensive and we don't currently
+    // account for shuffling than two inputs (32 bytes).
+    static const CostTblEntry ShuffleCostTbl[] = {
+        // One reg.
+        {2, MVT::v2i8, 1},  // interleave 2 x 2i8 into 4i8
+        {2, MVT::v4i8, 1},  // interleave 2 x 4i8 into 8i8
+        {2, MVT::v8i8, 1},  // interleave 2 x 8i8 into 16i8
+        {2, MVT::v2i16, 1}, // interleave 2 x 2i16 into 4i16
+        {2, MVT::v4i16, 1}, // interleave 2 x 4i16 into 8i16
+        {2, MVT::v2i32, 1}, // interleave 2 x 2i32 into 4i32
+
+        // Two regs.
+        {2, MVT::v16i8, 2}, // interleave 2 x 16i8 into 32i8
+        {2, MVT::v8i16, 2}, // interleave 2 x 8i16 into 16i16
+        {2, MVT::v4i32, 2}, // interleave 2 x 4i32 into 8i32
+
+        // One reg.
+        {4, MVT::v2i8, 4},  // interleave 4 x 2i8 into 8i8
+        {4, MVT::v4i8, 4},  // interleave 4 x 4i8 into 16i8
+        {4, MVT::v2i16, 4}, // interleave 4 x 2i16 into 8i16
+
+        // Two regs.
+        {4, MVT::v8i8, 16}, // interleave 4 x 8i8 into 32i8
+        {4, MVT::v4i16, 8}, // interleave 4 x 4i16 into 16i16
+        {4, MVT::v2i32, 4}, // interleave 4 x 2i32 into 8i32
+    };
+
+    EVT ETy = TLI->getValueType(DL, SubVecTy);
+    if (const auto *Entry =
+            CostTableLookup(ShuffleCostTbl, Factor, ETy.getSimpleVT()))
+      return Entry->Cost + (NumAccesses * MemCost);
+  }
+
+  return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
+                                           Alignment, AddressSpace, CostKind,
+                                           UseMaskForCond, UseMaskForGaps);
+}
+
 InstructionCost WebAssemblyTTIImpl::getVectorInstrCost(
     unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
     const Value *Op0, const Value *Op1) const {
diff --git a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
index 6b6d060076a80..e9adaea910847 100644
--- a/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
+++ b/llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.h
@@ -78,6 +78,10 @@ class WebAssemblyTTIImpl final : public BasicTTIImplBase<WebAssemblyTTIImpl> {
       TTI::TargetCostKind CostKind,
       TTI::OperandValueInfo OpInfo = {TTI::OK_AnyValue, TTI::OP_None},
       const Instruction *I = nullptr) const override;
+  InstructionCost getInterleavedMemoryOpCost(
+      unsigned Opcode, Type *Ty, unsigned Factor, ArrayRef<unsigned> Indices,
+      Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
+      bool UseMaskForCond, bool UseMaskForGaps) const override;
   using BaseT::getVectorInstrCost;
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
diff --git a/llvm/test/Transforms/LoopVectorize/WebAssembly/memory-interleave.ll b/llvm/test/Transforms/LoopVectorize/WebAssembly/memory-interleave.ll
new file mode 100644
index 0000000000000..c7128340c7a4e
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/WebAssembly/memory-interleave.ll
@@ -0,0 +1,1353 @@
+; REQUIRES: asserts
+; RUN: opt -mattr=+simd128 -passes=loop-vectorize -debug-only=loop-vectorize,vectorutils -disable-output < %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-p:32:32-p10:8:8-p20:8:8-i64:64-n32:64-S128-ni:1:10:20"
+target triple = "wasm32-unknown-wasi"
+
+%struct.TwoInts = type { i32, i32 }
+%struct.ThreeInts = type { i32, i32, i32 }
+%struct.FourInts = type { i32, i32, i32, i32 }
+%struct.ThreeShorts = type { i16, i16, i16 }
+%struct.FourShorts = type { i16, i16, i16, i16 }
+%struct.TwoBytes = type { i8, i8 }
+%struct.ThreeBytes = type { i8, i8, i8 }
+%struct.FourBytes = type { i8, i8, i8, i8 }
+%struct.FiveBytes = type { i8, i8, i8, i8, i8 }
+%struct.EightBytes = type { i8, i8, i8, i8, i8, i8, i8, i8 }
+
+; CHECK-LABEL: two_ints_same_op
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 2 at %10
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 2 at %10
+; CHECK: LV: Scalar loop costs: 12.
+; CHECK: LV: Vector loop of width 2 costs: 13.
+; CHECK: LV: Vector loop of width 4 costs: 6.
+; CHECK: LV: Selecting VF: 4
+define hidden void @two_ints_same_op(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %21, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.TwoInts, ptr %1, i32 %8
+  %10 = load i32, ptr %9, align 4
+  %11 = getelementptr inbounds %struct.TwoInts, ptr %2, i32 %8
+  %12 = load i32, ptr %11, align 4
+  %13 = add i32 %12, %10
+  %14 = getelementptr inbounds %struct.TwoInts, ptr %0, i32 %8
+  store i32 %13, ptr %14, align 4
+  %15 = getelementptr inbounds i8, ptr %9, i32 4
+  %16 = load i32, ptr %15, align 4
+  %17 = getelementptr inbounds i8, ptr %11, i32 4
+  %18 = load i32, ptr %17, align 4
+  %19 = add i32 %18, %16
+  %20 = getelementptr inbounds i8, ptr %14, i32 4
+  store i32 %19, ptr %20, align 4
+  %21 = add nuw i32 %8, 1
+  %22 = icmp eq i32 %21, %3
+  br i1 %22, label %6, label %7
+}
+
+; CHECK-LABEL: two_ints_vary_op
+; CHECK: Cost of 7 for VF 2: INTERLEAVE-GROUP with factor 2 at %10
+; CHECK: Cost of 6 for VF 4: INTERLEAVE-GROUP with factor 2 at %10
+; CHECK: LV: Scalar loop costs: 12.
+; CHECK: LV: Vector loop of width 2 costs: 13.
+; CHECK: LV: Vector loop of width 4 costs: 6.
+; CHECK: LV: Selecting VF: 4
+define hidden void @two_ints_vary_op(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %21, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.TwoInts, ptr %1, i32 %8
+  %10 = load i32, ptr %9, align 4
+  %11 = getelementptr inbounds %struct.TwoInts, ptr %2, i32 %8
+  %12 = load i32, ptr %11, align 4
+  %13 = add i32 %12, %10
+  %14 = getelementptr inbounds %struct.TwoInts, ptr %0, i32 %8
+  store i32 %13, ptr %14, align 4
+  %15 = getelementptr inbounds i8, ptr %9, i32 4
+  %16 = load i32, ptr %15, align 4
+  %17 = getelementptr inbounds i8, ptr %11, i32 4
+  %18 = load i32, ptr %17, align 4
+  %19 = sub i32 %16, %18
+  %20 = getelementptr inbounds i8, ptr %14, i32 4
+  store i32 %19, ptr %20, align 4
+  %21 = add nuw i32 %8, 1
+  %22 = icmp eq i32 %21, %3
+  br i1 %22, label %6, label %7
+}
+
+; CHECK-LABEL: three_ints
+; CHECK: Cost of 14 for VF 2: INTERLEAVE-GROUP with factor 3 at
+; CHECK: Cost of 28 for VF 4: INTERLEAVE-GROUP with factor 3 at
+; CHECK: LV: Scalar loop costs: 16.
+; CHECK: LV: Found an estimated cost of 14 for VF 2 For instruction: %10 = load i32, ptr %9
+; CHECK: LV: Found an estimated cost of 14 for VF 2 For instruction: %12 = load i32, ptr %11
+; CHECK: LV: Found an estimated cost of 14 for VF 2 For instruction: store i32 %25, ptr %26
+; CHECK: LV: Vector loop of width 2 costs: 24.
+; CHECK: LV: Found an estimated cost of 28 for VF 4 For instruction: %10 = load i32, ptr %9
+; CHECK: LV: Found an estimated cost of 28 for VF 4 For instruction: %12 = load i32, ptr %11
+; CHECK: LV: Found an estimated cost of 28 for VF 4 For instruction: store i32 %25, ptr %26
+; CHECK: LV: Vector loop of width 4 costs: 22.
+; CHECK: LV: Selecting VF: 1
+define hidden void @three_ints(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %27, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.ThreeInts, ptr %1, i32 %8
+  %10 = load i32, ptr %9, align 4
+  %11 = getelementptr inbounds %struct.ThreeInts, ptr %2, i32 %8
+  %12 = load i32, ptr %11, align 4
+  %13 = add nsw i32 %12, %10
+  %14 = getelementptr inbounds %struct.ThreeInts, ptr %0, i32 %8
+  store i32 %13, ptr %14, align 4
+  %15 = getelementptr inbounds i8, ptr %9, i32 4
+  %16 = load i32, ptr %15, align 4
+  %17 = getelementptr inbounds i8, ptr %11, i32 4
+  %18 = load i32, ptr %17, align 4
+  %19 = add nsw i32 %18, %16
+  %20 = getelementptr inbounds i8, ptr %14, i32 4
+  store i32 %19, ptr %20, align 4
+  %21 = getelementptr inbounds i8, ptr %9, i32 8
+  %22 = load i32, ptr %21, align 4
+  %23 = getelementptr inbounds i8, ptr %11, i32 8
+  %24 = load i32, ptr %23, align 4
+  %25 = add nsw i32 %24, %22
+  %26 = getelementptr inbounds i8, ptr %14, i32 8
+  store i32 %25, ptr %26, align 4
+  %27 = add nuw i32 %8, 1
+  %28 = icmp eq i32 %27, %3
+  br i1 %28, label %6, label %7
+}
+
+; CHECK-LABEL: three_shorts
+; CHECK: Cost of 26 for VF 4: INTERLEAVE-GROUP with factor 3
+; CHECK: Cost of 52 for VF 8: INTERLEAVE-GROUP with factor 3
+; CHECK: LV: Scalar loop costs: 16.
+; CHECK: LV: Found an estimated cost of 6 for VF 2 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 6 for VF 2 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 6 for VF 2 For instruction: store i16 %25
+; CHECK: LV: Vector loop of width 2 costs: 30.
+; CHECK: LV: Found an estimated cost of 26 for VF 4 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 26 for VF 4 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 26 for VF 4 For instruction: store i16 %25
+; CHECK: LV: Vector loop of width 4 costs: 21.
+; CHECK: LV: Found an estimated cost of 52 for VF 8 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 52 for VF 8 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 52 for VF 8 For instruction: store i16 %25
+; CHECK: LV: Vector loop of width 8 costs: 20.
+; CHECK: LV: Selecting VF: 1
+define hidden void @three_shorts(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %27, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.ThreeShorts, ptr %1, i32 %8
+  %10 = load i16, ptr %9, align 2
+  %11 = getelementptr inbounds %struct.ThreeShorts, ptr %2, i32 %8
+  %12 = load i16, ptr %11, align 2
+  %13 = mul i16 %12, %10
+  %14 = getelementptr inbounds %struct.ThreeShorts, ptr %0, i32 %8
+  store i16 %13, ptr %14, align 2
+  %15 = getelementptr inbounds i8, ptr %9, i32 2
+  %16 = load i16, ptr %15, align 2
+  %17 = getelementptr inbounds i8, ptr %11, i32 2
+  %18 = load i16, ptr %17, align 2
+  %19 = mul i16 %18, %16
+  %20 = getelementptr inbounds i8, ptr %14, i32 2
+  store i16 %19, ptr %20, align 2
+  %21 = getelementptr inbounds i8, ptr %9, i32 4
+  %22 = load i16, ptr %21, align 2
+  %23 = getelementptr inbounds i8, ptr %11, i32 4
+  %24 = load i16, ptr %23, align 2
+  %25 = mul i16 %24, %22
+  %26 = getelementptr inbounds i8, ptr %14, i32 4
+  store i16 %25, ptr %26, align 2
+  %27 = add nuw i32 %8, 1
+  %28 = icmp eq i32 %27, %3
+  br i1 %28, label %6, label %7
+}
+
+; CHECK-LABEL: four_shorts_same_op
+; CHECK: Cost of 18 for VF 2: INTERLEAVE-GROUP with factor 4
+; CHECK: Cost of 18 for VF 4: INTERLEAVE-GROUP with factor 4
+; CHECK: Cost of 18 for VF 4: INTERLEAVE-GROUP with factor 4
+; CHECK: Cost of 68 for VF 8: INTERLEAVE-GROUP with factor 4
+; CHECK: LV: Scalar loop costs: 20.
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: store i16
+; CHECK: LV: Vector loop of width 2 costs: 31.
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: store i16
+; CHECK: LV: Vector loop of width 4 costs: 15.
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: store i16
+; CHECK: LV: Vector loop of width 8 costs: 26
+; CHECK: LV: Selecting VF: 4
+define hidden void @four_shorts_same_op(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %33, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.FourShorts, ptr %1, i32 %8
+  %10 = load i16, ptr %9, align 2
+  %11 = getelementptr inbounds %struct.FourShorts, ptr %2, i32 %8
+  %12 = load i16, ptr %11, align 2
+  %13 = sub i16 %10, %12
+  %14 = getelementptr inbounds %struct.FourShorts, ptr %0, i32 %8
+  store i16 %13, ptr %14, align 2
+  %15 = getelementptr inbounds i8, ptr %9, i32 2
+  %16 = load i16, ptr %15, align 2
+  %17 = getelementptr inbounds i8, ptr %11, i32 2
+  %18 = load i16, ptr %17, align 2
+  %19 = sub i16 %16, %18
+  %20 = getelementptr inbounds i8, ptr %14, i32 2
+  store i16 %19, ptr %20, align 2
+  %21 = getelementptr inbounds i8, ptr %9, i32 4
+  %22 = load i16, ptr %21, align 2
+  %23 = getelementptr inbounds i8, ptr %11, i32 4
+  %24 = load i16, ptr %23, align 2
+  %25 = sub i16 %22, %24
+  %26 = getelementptr inbounds i8, ptr %14, i32 4
+  store i16 %25, ptr %26, align 2
+  %27 = getelementptr inbounds i8, ptr %9, i32 6
+  %28 = load i16, ptr %27, align 2
+  %29 = getelementptr inbounds i8, ptr %11, i32 6
+  %30 = load i16, ptr %29, align 2
+  %31 = sub i16 %28, %30
+  %32 = getelementptr inbounds i8, ptr %14, i32 6
+  store i16 %31, ptr %32, align 2
+  %33 = add nuw i32 %8, 1
+  %34 = icmp eq i32 %33, %3
+  br i1 %34, label %6, label %7
+}
+
+; CHECK-LABEL: four_shorts_split_op
+; CHECK: Cost of 18 for VF 2: INTERLEAVE-GROUP with factor 4
+; CHECK: Cost of 18 for VF 4: INTERLEAVE-GROUP with factor 4
+; CHECK: Cost of 68 for VF 8: INTERLEAVE-GROUP with factor 4
+; CHECK: LV: Scalar loop costs: 20.
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 2 For instruction: store i16
+; CHECK: LV: Vector loop of width 2 costs: 31.
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 18 for VF 4 For instruction: store i16 %31
+; CHECK: LV: Vector loop of width 4 costs: 15.
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: %10 = load i16
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: %12 = load i16
+; CHECK: LV: Found an estimated cost of 68 for VF 8 For instruction: store i16 %31
+; CHECK: LV: Vector loop of width 8 costs: 26.
+; CHECK: LV: Selecting VF: 4
+define hidden void @four_shorts_split_op(ptr noalias nocapture noundef writeonly %0, ptr nocapture noundef readonly %1, ptr nocapture noundef readonly %2, i32 noundef %3) {
+  %5 = icmp eq i32 %3, 0
+  br i1 %5, label %6, label %7
+
+6:                                                ; preds = %7, %4
+  ret void
+
+7:                                                ; preds = %4, %7
+  %8 = phi i32 [ %33, %7 ], [ 0, %4 ]
+  %9 = getelementptr inbounds %struct.FourShorts, ptr %1, i32 %8
+  %10 = load i16, ptr %9, align 2
+  %11 = getelementptr inbounds %struct.FourShorts, ptr %2, i32 %8
+  %12 = load i16, ptr %11, align 2
+  %13 = or i16 %12, %10
+  %14 = getelementptr inbounds %struct.FourShorts, ptr %0, i32 %8
+  st...
[truncated]

@tlively
Copy link
Collaborator

tlively commented Jul 7, 2025

Is there a simple before-and-after example showing the effect of this change on codegen? Would we expect this to be a performance win or at least neutral on platforms other than AArch64?

@sparker-arm
Copy link
Contributor Author

Is there a simple before-and-after example showing the effect of this change on codegen?

Are we still allowed to precommit tests without review? I could add the cost test and run it through llc?

Would we expect this to be a performance win or at least neutral on platforms other than AArch64?

I don't see why it wouldn't be beneficial for other platforms, the cost modelling only really assumes it is cheap to select even and odd lanes. It's currently a neutral change on my Xeon machine and I assume most runtimes will have to do some extra work to take advantage of all the benefits of this.

@tlively
Copy link
Collaborator

tlively commented Jul 15, 2025

Is there a simple before-and-after example showing the effect of this change on codegen?

Are we still allowed to precommit tests without review? I could add the cost test and run it through llc?

I don't know what the policy is, but this would be helpful.

@sparker-arm
Copy link
Contributor Author

I've added the test in #149045.

First pass where we calculate the cost of the memory operation, as
well as the shuffles required. Interleaving by a factor of two should
be relatively cheap, as many ISAs have dedicated instructions to
perform the (de)interleaving. Several of these permutations can be
combined for an interleave stride of 4 and this is the highest stride
we allow.

I've costed larger vectors, and more lanes, as more expensive because
not only is more work is needed but the risk of codegen going 'wrong'
rises dramatically. I also filled in a bit of cost modelling for
vector stores.

It appears the main vector plan to avoid is an interleave factor of 4
with v16i8. I've used libyuv and ncnn for benchmarking, using V8 on
AArch64, and observe geomean improvement of ~3% which some kernels
improving 40-60%.

I know there is still significant performance being left on the
table, so this will need more development along with the rest of the
cost model.
@sparker-arm sparker-arm force-pushed the wasm-interleave-mem-cost branch from abe6d36 to 581326a Compare July 30, 2025 15:29
@sparker-arm
Copy link
Contributor Author

At least one of the test changes are now dependent upon #151145 and I'm not sure how to manage that officially here. So, one test should fail because of that.

@sparker-arm
Copy link
Contributor Author

Now I have my wasi-sdk all up-to-date, these are my results for V8 running on my Xeon:

Metric                    Speedup(%)
----------------------  ------------
Mean (filtered)               12.965
Median (filtered)             -1.069
Max (filtered)               401.656
Min (filtered)               -14.286
Geomean (non-filtered)         3.417 

Benchmark           Min      Max    Median    Mean
--------------  -------  -------  --------  ------
bullet           -3.248   -2.476    -2.862  -2.862
doe_proxyapps    -4.405    1.216    -2.011  -1.859
libyuv           -4.018  401.656    -1.036  69.264
lzma             -1.555   -1.555    -1.555  -1.555
maratis          -2.727   -2.727    -2.727  -2.727
oggenc           -1.063   -1.063    -1.063  -1.063
meshoptimizer    -1.164   -1.164    -1.164  -1.164
microkernels      4.096    4.096     4.096   4.096
ncnn            -14.286   36.07     -0.852  -0.102
pairlocalalign   -2.651   -2.651    -2.651  -2.651
pocket_nn        -1.926   -1.926    -1.926  -1.926
quickjs          -4.158   -4.158    -4.158  -4.158
raytracing       -2.638   -2.638    -2.638  -2.638
small3dlib       -1.075   14.478     6.702   6.702
spec2017         -2.619    4.257     2.865   1.842
spiff             2.241    2.241     2.241   2.241
sqlite3           3.681    3.681     3.681   3.681
tsvc              2.174    2.174     2.174   2.174
wasm3             2.776    2.776     2.776   2.776
zlib_bench       -0.023    0.893     0.435   0.435

So, there are some regressions but even for benchmarks like ncnn, where there are swings, the overall result is neutral:

Metric                    Speedup(%)
----------------------  ------------
Mean (filtered)               -0.102
Median (filtered)             -0.852
Max (filtered)                36.07
Min (filtered)               -14.286
Geomean (non-filtered)        -0.429 

Benchmark                             Speedup(%)
----------------------------------  ------------
ncnn-FastestDet-run_times                 -7.429
ncnn-alexnet-run_times                    -0.852
ncnn-blazeface-run_times                  -2.016
ncnn-efficientnet_b0-run_times            -8.698
ncnn-efficientnetv2_b0-run_times          -6.711
ncnn-googlenet-run_times                   0.041
ncnn-googlenet_int8-run_times             -2.545
ncnn-mnasnet-run_times                    -5.136
ncnn-mobilenet-run_times                  -1.182
ncnn-mobilenet_int8-run_times             -5.882
ncnn-mobilenet_ssd-run_times               1.681
ncnn-mobilenet_ssd_int8-run_times         -0.376
ncnn-mobilenet_v2-run_times               -7.895
ncnn-mobilenet_v3-run_times              -14.286
ncnn-mobilenet_yolo-run_times             -3.361
ncnn-mobilenetv2_yolov3-run_times          1.734
ncnn-nanodet_m-run_times                   2.047
ncnn-proxylessnasnet-run_times             2.257
ncnn-regnety_400m-run_times                1.562
ncnn-resnet18-run_times                   -7.028
ncnn-resnet18_int8-run_times              -1.587
ncnn-resnet50-run_times                   -4.343
ncnn-resnet50_int8-run_times               0.301
ncnn-shufflenet-run_times                 36.07
ncnn-shufflenet_v2-run_times              -0
ncnn-squeezenet-run_times                 22.721
ncnn-squeezenet_int8-run_times             4.812
ncnn-squeezenet_ssd-run_times             -0.041
ncnn-squeezenet_ssd_int8-run_times         0.96
ncnn-vgg16-run_times                      -1.288
ncnn-vgg16_int8-run_times                 -0.259
ncnn-vision_transformer-run_times         -3.205
ncnn-yolo_fastest_1.1-run_times           -5.003
ncnn-yolo_fastestv2-run_times              2.779
ncnn-yolov4_tiny-run_times                 8.589

And this is a closer look at libyuv:

Metric                    Speedup(%)
----------------------  ------------
Mean (filtered)               57.306
Median (filtered)             -1.305
Max (filtered)               401.656
Min (filtered)                -4.018
Geomean (non-filtered)        15.651 

Benchmark                                        Speedup(%)
---------------------------------------------  ------------
libyuv-ARGBScaleDownBy2_Bilinear-run_times           75.091
libyuv-ARGBScaleDownBy2_Box-run_times                72.442
libyuv-ARGBScaleDownBy2_Linear-run_times             -1.31
libyuv-ARGBScaleDownBy2_None-run_times               67.661
libyuv-ARGBScaleDownBy3by4_Bilinear-run_times        -1.816
libyuv-ARGBScaleDownBy3by4_Box-run_times             -1.731
libyuv-ARGBScaleDownBy3by4_Linear-run_times          -3.43
libyuv-ARGBScaleDownBy4_Box-run_times                73.035
libyuv-ColourI420-run_times                          -2.784
libyuv-ColourI422-run_times                          -4.018
libyuv-ColourJ420-run_times                          -2.776
libyuv-ColourJ422-run_times                          -2.562
libyuv-NV12ToI420-run_times                         401.656
libyuv-NV21ToI420-run_times                         367.548
libyuv-UVScaleDownBy2_Box-run_times                  -2.198
libyuv-UVScaleDownBy2_Linear-run_times               -1.3
libyuv-UVScaleDownBy3by4_Box-run_times               -0.959
libyuv-UVScaleDownBy3by4_None-run_times              -1.036

Copy link
Member

@dschuff dschuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK. I'll admit I'm not really an expert in the implementation here, I'm mostly going on the output, which does look like the kind of improvement that you'd expect from loop vectorization. There are maybe a few more regressions than would be ideal, but they are mostly small, and some of the improvements are quite impressive, I think it seems better overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants