From f510205c735b598af1bdc0a379af13d82d9f77e8 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 27 Feb 2021 17:32:07 +0100 Subject: [PATCH 1/6] SVE-like flexible vectors --- proposals/flexible-vectors/FlexibleVectors.md | 832 ------------ .../FlexibleVectorsSecondTier.md | 58 - .../FlexibleVectorsThirdTier.md | 47 - proposals/flexible-vectors/README.md | 1129 ++++++++++++++++- 4 files changed, 1124 insertions(+), 942 deletions(-) delete mode 100644 proposals/flexible-vectors/FlexibleVectors.md delete mode 100644 proposals/flexible-vectors/FlexibleVectorsSecondTier.md delete mode 100644 proposals/flexible-vectors/FlexibleVectorsThirdTier.md diff --git a/proposals/flexible-vectors/FlexibleVectors.md b/proposals/flexible-vectors/FlexibleVectors.md deleted file mode 100644 index 614cd45..0000000 --- a/proposals/flexible-vectors/FlexibleVectors.md +++ /dev/null @@ -1,832 +0,0 @@ -Flexible vectors overview -========================= - -The goal of this proposal is to provide flexible vector instructions for -WebAssembly as a way to bridge the gap between existing SIMD instruction sets -available on various platforms. More specifically, this proposal aims to enable -better use processing capabilities of existing SIMD hardware and bring -performance of vector operaions available in WebAssembly closer to native. -`simd128` proposal already identified operations that would commonly work on -platforms that are important to WebAssembly, this proposal is attempting to -extend the same operations to work with variable vector lengths. - -The rest of this document contains instructions that have uncontroversial -lowering on all platforms. There are two more tiers of instructions: second -tier containing instructions with more complex lowering without effect on other -instructions, and third containing instructions affecting execution semantics -or lowering of other instructions. - -See [FlexibleVectorsSecondTier.md](FlexibleVectorsSecondTier.md) for the second -tier and [FlexibleVectorsThirdTier.md](FlexibleVectorsThirdTier.md) for the -third tier. - -## Types - -Proposal introduces the following vector types: - -- `vec.i8` : 8-bit integer lanes -- `vec.i16`: 16-bit integer lanes -- `vec.i32`: 32-bit integer lanes -- `vec.i64`: 64-bit integer lanes -- `vec.f32`: single precision floating point lanes -- `vec.f64`: double precision floating point lanes - -### Lane division interpretation - -In semantic pseudocode `S` is the particular vector type, `S.LaneBits` is the -size of the lane in bits, `S.Lanes` is the number of lanes, which is dynamic. - -| S | S.LaneBits | -|-----------|-----------:| -| `vec.i8` | 8 | -| `vec.i16` | 16 | -| `vec.i32` | 32 | -| `vec.f32` | 32 | -| `vec.i64` | 64 | -| `vec.f64` | 64 | - -### Restrictions - -Lane values are intended to be handled exactly like in `simd128` proposal, with -the following differences applying to overall types: - -- Runtime sets maximum vector length for every type -- Number of lanes is set separately for different lane sizes -- Vectors with different lane size are not immediately interoperable - -## Immediate operands - -_TBD_ value range, depends on instruction encoding. - -- `ImmLaneIdxV8`: lane index for 8-bit lanes -- `ImmLaneIdxV16`: lane index for 16-bit lanes -- `ImmLaneIdxV32`: lane index for 32-bit lanes -- `ImmLaneIdxV64`: lane index for 64-bit lanes - -## Operations - -Completely new operations introduced in this proposal are the operations that -provide interface to vector length. - -### Vector length - -Querying length of supported vector: - -- `vec.i8.length -> i32` -- `vec.i16.length -> i32` -- `vec.i32.length -> i32` -- `vec.i64.length -> i32` -- `vec.f32.length -> i32` -- `vec.f64.length -> i32` - -### Constructing vector values - -Create vector with identical lanes: - -- `vec.i8.splat(x:i32) -> vec.i8` -- `vec.i16.splat(x:i32) -> vec.i16` -- `vec.i32.splat(x:i32) -> vec.i32` -- `vec.i64.splat(x:i64) -> vec.i64` -- `vec.f32.splat(x:f32) -> vec.f32` -- `vec.f64.splat(x:f64) -> vec.f64` - -Construct vector with `x` replicated to all lanes: - -```python -def S.splat(x): - result = S.New() - for i in range(S.Lanes): - result[i] = x - return result -``` - -### Accessing lanes - -#### Extract lane as a scalar - -- `vec.i8.extract_lane_s(a: vec.i8, imm: ImmLaneIdxV8) -> i32` -- `vec.i8.extract_lane_u(a: vec.i8, imm: ImmLaneIdxV8) -> i32` -- `vec.i16.extract_lane_s(a: vec.i16, imm: ImmLaneIdxV16) -> i32` -- `vec.i16.extract_lane_u(a: vec.i16, imm: ImmLaneIdxV16) -> i32` -- `vec.i32.extract_lane(a: vec.i32, imm: ImmLaneIdxV32) -> i32` -- `vec.i64.extract_lane(a: vec.i64, imm: ImmLaneIdxV64) -> i64` -- `vec.f32.extract_lane(a: vec.f32, imm: ImmLaneIdxV32) -> f32` -- `vec.f64.extract_lane(a: vec.f64, imm: ImmLaneIdxV64) -> f64` - -Extract the scalar value of lane specified in the immediate mode operand `imm` -in `a`. The `{interpretation}.extract_lane{_s}{_u}` instructions are encoded -with one immediate byte providing the index of the lane to extract. - -```python -def S.extract_lane(a, i): - return a[i] -``` - -The `_s` and `_u` variants will sign-extend or zero-extend the lane value to -`i32` respectively. - -#### Replace lane value - -- `vec.i8.replace_lane(a: vec.i8, imm: ImmLaneIdxV8, x: i32) -> vec.i8` -- `vec.i16.replace_lane(a: vec.i16, imm: ImmLaneIdxV16, x: i32) -> vec.i16` -- `vec.i32.replace_lane(a: vec.i32, imm: ImmLaneIdxV32, x: i32) -> vec.i32` -- `vec.i64.replace_lane(a: vec.i64, imm: ImmLaneIdxV64, x: i64) -> vec.i64` -- `vec.f32.replace_lane(a: vec.f32, imm: ImmLaneIdxV32, x: f32) -> vec.f32` -- `vec.f64.replace_lane(a: vec.f64, imm: ImmLaneIdxV64, x: f64) -> vec.f64` - -Return a new vector with lanes identical to `a`, except for the lane specified -in the immediate mode operand `imm` which has the value `x`. The -`{interpretation}.replace_lane` instructions are encoded with an immediate byte -providing the index of the lane the value of which is to be replaced. - -```python -def S.replace_lane(a, i, x): - result = S.New() - for j in range(S.Lanes): - result[j] = a[j] - result[i] = x - return result -``` - -The input lane value, `x`, is interpreted the same way as for the splat -instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored. - -### Shuffles - -#### Left lane-wise shift by scalar - -* `vec.i8.lshl(a: vec.i8, x: i32) -> vec.i8` -* `vec.i16.lshl(a: vec.i16, x: i32) -> vec.i16` -* `vec.i32.lshl(a: vec.i32, x: i32) -> vec.i32` -* `vec.i64.lshl(a: vec.i64, x: i32) -> vec.i64` - -Returns a new vector with lanes selected from the lanes of the two input -vectors `a` and `b` by shifting lanes of the original to the left by the amount -specified in the integer argument and shifting zero values in. - -```python -def S.lshl(a, x): - result = S.New() - for i in range(S.Lanes): - if i < x: - result[i] = 0 - else: - result[i] = a[i - x] - return result -``` - -#### Right lane-wise shift by scalar - -* `vec.i8.lshr(a: vec.i8, x: i32) -> vec.i8` -* `vec.i16.lshr(a: vec.i16, x: i32) -> vec.i16` -* `vec.i32.lshr(a: vec.i32, x: i32) -> vec.i32` -* `vec.i64.lshr(a: vec.i64, x: i32) -> vec.i64` - -Returns a new vector with lanes selected from the lanes of the two input -vectors `a` and `b` by shifting lanes of the original to the right by the -amount specified in the integer argument and shifting zero values in. - -```python -def S.lshr(a, x): - result = S.New() - for i in range(S.Lanes): - if i < S.Lanes - x: - result[i] = a[i + x] - else: - result[i] = 0 - return result -``` - -### Integer arithmetic - -Wrapping integer arithmetic discards the high bits of the result. - -```python -def S.Reduce(x): - bitmask = (1 << S.LaneBits) - 1 - return x & bitmask -``` - -Integer division operation is omitted to be compatible with 128-bit SIMD. - -#### Integer addition - -- `vec.i8.add(a: vec.i8, b: vec.i8) -> vec.i8` -- `vec.i16.add(a: vec.i16, b: vec.i16) -> vec.i16` -- `vec.i32.add(a: vec.i32, b: vec.i32) -> vec.i32` -- `vec.i64.add(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise wrapping integer addition: - -```python -def S.add(a, b): - def add(x, y): - return S.Reduce(x + y) - return S.lanewise_binary(add, a, b) -``` - -#### Integer subtraction - -- `vec.i8.sub(a: vec.i8, b: vec.i8) -> vec.i8` -- `vec.i16.sub(a: vec.i16, b: vec.i16) -> vec.i16` -- `vec.i32.sub(a: vec.i32, b: vec.i32) -> vec.i32` -- `vec.i64.sub(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise wrapping integer subtraction: - -```python -def S.sub(a, b): - def sub(x, y): - return S.Reduce(x - y) - return S.lanewise_binary(sub, a, b) -``` - -#### Integer multiplication - -- `vec.i8.mul(a: vec.i8, b: vec.i8) -> vec.i8` -- `vec.i16.mul(a: vec.i16, b: vec.i16) -> vec.i16` -- `vec.i32.mul(a: vec.i32, b: vec.i32) -> vec.i32` -- `vec.i64.mul(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise wrapping integer multiplication: - -```python -def S.mul(a, b): - def mul(x, y): - return S.Reduce(x * y) - return S.lanewise_binary(mul, a, b) -``` - -#### Integer negation - -- `vec.i8.neg(a: vec.i8, b: vec.i8) -> vec.i8` -- `vec.i16.neg(a: vec.i16, b: vec.i16) -> vec.i16` -- `vec.i32.neg(a: vec.i32, b: vec.i32) -> vec.i32` -- `vec.i64.neg(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise wrapping integer negation. In wrapping arithmetic, `y = -x` is the -unique value such that `x + y == 0`. - -```python -def S.neg(a): - def neg(x): - return S.Reduce(-x) - return S.lanewise_unary(neg, a) -``` - -### Saturating integer arithmetic - -Saturating integer arithmetic behaves differently on signed and unsigned lanes. - -```python -def S.SignedSaturate(x): - if x < S.Smin: - return S.Smin - if x > S.Smax: - return S.Smax - return x - -def S.UnsignedSaturate(x): - if x < 0: - return 0 - if x > S.Umax: - return S.Umax - return x -``` - -#### Saturating integer addition - -* `vec.i8.add_sat_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.add_sat_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.add_sat_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.add_sat_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise saturating addition: - -```python -def S.add_sat_s(a, b): - def addsat(x, y): - return S.SignedSaturate(x + y) - return S.lanewise_binary(addsat, S.AsSigned(a), S.AsSigned(b)) - -def S.add_sat_u(a, b): - def addsat(x, y): - return S.UnsignedSaturate(x + y) - return S.lanewise_binary(addsat, S.AsUnsigned(a), S.AsUnsigned(b)) -``` - -#### Saturating integer subtraction - -* `vec.i8.sub_sat_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.sub_sat_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.sub_sat_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.sub_sat_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise saturating subtraction: - -```python -def S.sub_sat_s(a, b): - def subsat(x, y): - return S.SignedSaturate(x - y) - return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b)) - -def S.sub_sat_u(a, b): - def subsat(x, y): - return S.UnsignedSaturate(x - y) - return S.lanewise_binary(subsat, S.AsUnsigned(a), S.AsUnsigned(b)) -``` - -#### Lane-wise integer minimum - -* `vec.i8.min_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.min_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.min_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.min_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.min_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.min_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.min_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.min_u(a: vec.i64, b: vec.i64) -> vec.i64` - -Compares lane-wise signed/unsigned integers, and returns the minimum of -each pair. - -```python -def S.min(a, b): - return S.lanewise_binary(min, a, b) -``` - -#### Lane-wise integer maximum - -* `vec.i8.max_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.max_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.max_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.max_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.max_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.max_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.max_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.max_u(a: vec.i64, b: vec.i64) -> vec.i64` - -Compares lane-wise signed/unsigned integers, and returns the maximum of -each pair. - -```python -def S.max(a, b): - return S.lanewise_binary(max, a, b) -``` - -#### Lane-wise integer rounding average - -* `vec.i8.avgr_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.avgr_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.avgr_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.avgr_u(a: vec.i64, b: vec.i64) -> vec.i64` - -Lane-wise rounding average: - -```python -def S.RoundingAverage(x, y): - return (x + y + 1) // 2 - -def S.avgr_u(a, b): - return S.lanewise_binary(S.RoundingAverage, S.AsUnsigned(a), S.AsUnsigned(b)) -``` - -#### Lane-wise integer absolute value - -* `vec.i8.abs(a: vec.i8) -> vec.i8` -* `vec.i16.abs(a: vec.i16) -> vec.i16` -* `vec.i32.abs(a: vec.i32) -> vec.i32` -* `vec.i64.abs(a: vec.i64) -> vec.i64` - -Lane-wise wrapping absolute value. - -```python -def S.abs(a): - return S.lanewise_unary(abs, S.AsSigned(a)) -``` - - -### Bit shifts - -#### Left shift by scalar - -* `vec.i8.shl(a: vec.i8, y: i32) -> vec.i8` -* `vec.i16.shl(a: vec.i16, y: i32) -> vec.i16` -* `vec.i32.shl(a: vec.i32, y: i32) -> vec.i32` -* `vec.i64.shl(a: vec.i64, y: i32) -> vec.i64` - -Shift the bits in each lane to the left by the same amount. The shift count is -taken modulo lane width: - -```python -def S.shl(a, y): - # Number of bits to shift: 0 .. S.LaneBits - 1. - amount = y mod S.LaneBits - def shift(x): - return S.Reduce(x << amount) - return S.lanewise_unary(shift, a) -``` - -#### Right shift by scalar - -* `vec.i8.shr_s(a: vec.i8, y: i32) -> vec.i8` -* `vec.i8.shr_u(a: vec.i8, y: i32) -> vec.i8` -* `vec.i16.shr_s(a: vec.i16, y: i32) -> vec.i16` -* `vec.i16.shr_u(a: vec.i16, y: i32) -> vec.i16` -* `vec.i32.shr_s(a: vec.i32, y: i32) -> vec.i32` -* `vec.i32.shr_u(a: vec.i32, y: i32) -> vec.i32` -* `vec.i64.shr_s(a: vec.i64, y: i32) -> vec.i64` -* `vec.i64.shr_u(a: vec.i64, y: i32) -> vec.i64` - -Shift the bits in each lane to the right by the same amount. The shift count is -taken modulo lane width. This is an arithmetic right shift for the `_s` -variants and a logical right shift for the `_u` variants. - -```python -def S.shr_s(a, y): - # Number of bits to shift: 0 .. S.LaneBits - 1. - amount = y mod S.LaneBits - def shift(x): - return x >> amount - return S.lanewise_unary(shift, S.AsSigned(a)) - -def S.shr_u(a, y): - # Number of bits to shift: 0 .. S.LaneBits - 1. - amount = y mod S.LaneBits - def shift(x): - return x >> amount - return S.lanewise_unary(shift, S.AsUnsigned(a)) -``` - - -### Bitwise operations - -#### Bitwise logic - -* `vec.i8.and(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.or(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.xor(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.not(a: vec.i8) -> vec.i8` - -The logical operations defined on the scalar integer types are also available -on the `v128` type where they operate bitwise the same way C's `&`, `|`, `^`, -and `~` operators work on an `unsigned` type. - -#### Bitwise AND-NOT - -* `vec.i8.andnot(a: vec.i8, b: vec.i8) -> vec.i8` - -Bitwise AND of bits of `a` and the logical inverse of bits of `b`. This operation is equivalent to `vec.i8.and(a, vec.i8.not(b))`. - -#### Bitwise select - -* `vec.i8.bitselect(v1: vec.i8, v2: vec.i8, c: vec.i8) -> vec.i8` - -Use the bits in the control mask `c` to select the corresponding bit from `v1` -when 1 and `v2` when 0. -This is the same as `vec.i8.or(vec.i8.and(v1, c), vec.i8.and(v2, vec.i8.not(c)))`. - -Note that the normal WebAssembly `select` instruction also works with vector -types. It selects between two whole vectors controlled by a single scalar value, -rather than selecting bits controlled by a control mask vector. - -### Boolean horizontal reductions - -These operations reduce all the lanes of an integer vector to a single scalar -0 or 1 value. A lane is considered "true" if it is non-zero. - -#### Any lane true - -* `vec.i8.any_true(a: vec.i8) -> i32` -* `vec.i16.any_true(a: vec.i16) -> i32` -* `vec.i32.any_true(a: vec.i32) -> i32` - -These functions return 1 if any lane in `a` is non-zero, 0 otherwise. - -```python -def S.any_true(a): - for i in range(S.Lanes): - if a[i] != 0: - return 1 - return 0 -``` - -#### All lanes true - -* `vec.i8.all_true(a: vec.i8) -> i32` -* `vec.i16.all_true(a: vec.i16) -> i32` -* `vec.i32.all_true(a: vec.i32) -> i32` - -These functions return 1 if all lanes in `a` are non-zero, 0 otherwise. - -```python -def S.all_true(a): - for i in range(S.Lanes): - if a[i] == 0: - return 0 - return 1 -``` - -### Comparisons - -The comparison operations all compare two vectors lane-wise, and produce a mask -vector with the same number of lanes as the input interpretation where the bits -in each lane are `0` for `false` and all ones for `true`. - -
- Implementation notes - - Some classes of comparison operations (for example in AVX512 and SVE) return - a mask while others return a vector containing results in its lanes. This - section mightn need to be tuned. - -
- -#### Equality - -* `vec.i8.eq(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.eq(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.eq(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.eq(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.eq(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.eq(a: vec.f64, b: vec.f64) -> vec.f64` - -Integer equality is independent of the signed/unsigned interpretation. Floating -point equality follows IEEE semantics, so a NaN lane compares not equal with -anything, including itself, and +0.0 is equal to -0.0: - -```python -def S.eq(a, b): - def eq(x, y): - return x == y - return S.lanewise_comparison(eq, a, b) -``` - -#### Non-equality - -* `vec.i8.ne(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.ne(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.ne(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.ne(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.ne(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.ne(a: vec.f64, b: vec.f64) -> vec.f64` - -The `ne` operations produce the inverse of their `eq` counterparts: - -```python -def S.ne(a, b): - def ne(x, y): - return x != y - return S.lanewise_comparison(ne, a, b) -``` - -#### Less than - -* `vec.i8.lt_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.lt_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.lt_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.lt_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.lt_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.lt_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.lt_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.lt_u(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.lt(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.lt(a: vec.f64, b: vec.f64) -> vec.f64` - -#### Less than or equal - -* `vec.i8.le_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.le_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.le_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.le_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.le_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.le_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.le_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.le_u(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.le(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.le(a: vec.f64, b: vec.f64) -> vec.f64` - -#### Greater than - -* `vec.i8.gt_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.gt_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.gt_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.gt_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.gt_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.gt_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.gt_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.gt(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.gt(a: vec.f64, b: vec.f64) -> vec.f64` - -#### Greater than or equal - -* `vec.i8.ge_s(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i8.ge_u(a: vec.i8, b: vec.i8) -> vec.i8` -* `vec.i16.ge_s(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i16.ge_u(a: vec.i16, b: vec.i16) -> vec.i16` -* `vec.i32.ge_s(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i32.ge_u(a: vec.i32, b: vec.i32) -> vec.i32` -* `vec.i64.ge_s(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.i64.ge_u(a: vec.i64, b: vec.i64) -> vec.i64` -* `vec.f32.ge(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.ge(a: vec.f64, b: vec.f64) -> vec.f64` - -#### Load and store - -- `vec.v8.load(memarg) -> vec.v8` -- `vec.v16.load(memarg) -> vec.v16` -- `vec.v32.load(memarg) -> vec.v32` -- `vec.v64.load(memarg) -> vec.v64` - -Load a vector from the given heap address. - -- `vec.v8.store(memarg, data:vec.v8)` -- `vec.v16.store(memarg, data:vec.v16)` -- `vec.v32.store(memarg, data:vec.v32)` -- `vec.v64.store(memarg, data:vec.v64)` - -Store a vector to the given heap address. - -### Floating-point sign bit operations - -These floating point operations are simple manipulations of the sign bit. No -changes are made to the exponent or trailing significand bits, even for NaN -inputs. - -#### Negation - -* `vec.f32.neg(a: vec.f32) -> vec.f32` -* `vec.f64.neg(a: vec.f64) -> vec.f64` - -Apply the IEEE `negate(x)` function to each lane. This simply inverts the sign -bit, preserving all other bits. - -```python -def S.neg(a): - return S.lanewise_unary(ieee.negate, a) -``` - -#### Floating-point absolute value - -* `vec.f32.abs(a: vec.f32) -> vec.f32` -* `vec.f64.abs(a: vec.f64) -> vec.f64` - -Apply the IEEE `abs(x)` function to each lane. This simply clears the sign bit, -preserving all other bits. - -```python -def S.abs(a): - return S.lanewise_unary(ieee.abs, a) -``` - -### Floating-point min and max - -#### Pseudo-minimum - -* `vec.f32.pmin(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.pmin(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise minimum value, defined as `b < a ? b : a`. - -#### Pseudo-maximum - -* `vec.f32.pmax(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.pmax(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise maximum value, defined as `a < b ? b : a`. - -### Floating-point arithmetic - -The floating-point arithmetic operations are all lane-wise versions of the -existing scalar WebAssembly operations. - -#### Addition - -- `vec.f32.add(a: vec.f32, b: vec.f32) -> vec.f32` -- `vec.f64.add(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise IEEE `addition`. - -#### Subtraction - -- `vec.f32.sub(a: vec.f32, b: vec.f32) -> vec.f32` -- `vec.f64.sub(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise IEEE `subtraction`. - -#### Division - -- `vec.f32.div(a: vec.f32, b: vec.f32) -> vec.f32` -- `vec.f64.div(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise IEEE `division`. - -#### Multiplication - -- `vec.f32.mul(a: vec.f32, b: vec.f32) -> vec.f32` -- `vec.f64.mul(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise IEEE `multiplication`. - -#### Square root - -- `vec.f32.sqrt(a: vec.f32, b: vec.f32) -> vec.f32` -- `vec.f64.sqrt(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise IEEE `squareRoot`. - - -### Conversions - -#### Integer to floating point - -* `vec.f32.convert_s(a: vec.i32) -> vec.f32` -* `vec.f64.convert_s(a: vec.i64) -> vec.f64` - -Lane-wise conversion from integer to floating point. Some integer values will be -rounded. - -#### Integer to integer narrowing - -* `vec.i16.narrow_s(a: vec.i16, b: vec.i16) -> vec.i8` -* `vec.i16.narrow_u(a: vec.i16, b: vec.i16) -> vec.i8` -* `vec.i32.narrow_s(a: vec.i32, b: vec.i32) -> vec.i16` -* `vec.i32.narrow_u(a: vec.i32, b: vec.i32) -> vec.i16` -* `vec.i64.narrow_s(a: vec.i64, b: vec.i64) -> vec.i32` -* `vec.i64.narrow_u(a: vec.i64, b: vec.i64) -> vec.i32` - -Converts two input vectors into a smaller lane vector by narrowing each lane, -signed or unsigned. The signed narrowing operation will use signed saturation -to handle overflow, 0x7f or 0x80 for i8x16, the unsigned narrowing operation -will use unsigned saturation to handle overflow, 0x00 or 0xff for i8x16. -Regardless of the whether the operation is signed or unsigned, the input lanes -are interpreted as signed integers. - -```python -def S.narrow_T_s(a, b): - result = S.New() - for i in range(T.Lanes): - result[i] = S.SignedSaturate(a[i]) - for i in range(T.Lanes): - result[T.Lanes + i] = S.SignedSaturate(b[i]) - return result - -def S.narrow_T_u(a, b): - result = S.New() - for i in range(T.Lanes): - result[i] = S.UnsignedSaturate(a[i]) - for i in range(T.Lanes): - result[T.Lanes + i] = S.UnsignedSaturate(b[i]) - return result -``` - -#### Integer to integer widening - -* `vec.i8.widen_low_s(a: vec.i8) -> vec.i16` -* `vec.i8.widen_high_s(a: vec.i8) -> vec.i16` -* `vec.i8.widen_low_u(a: vec.i8) -> vec.i16` -* `vec.i8.widen_high_u(a: vec.i8) -> vec.i16` -* `vec.i16.widen_low_s(a: vec.i16) -> vec.i32` -* `vec.i16.widen_high_s(a: vec.i16) -> vec.i32` -* `vec.i16.widen_low_u(a: vec.i16) -> vec.i32` -* `vec.i16.widen_high_u(a: vec.i16) -> vec.i32` -* `vec.i32.widen_low_s(a: vec.i32) -> vec.i64` -* `vec.i32.widen_high_s(a: vec.i32) -> vec.i64` -* `vec.i32.widen_low_u(a: vec.i32) -> vec.i64` -* `vec.i32.widen_high_u(a: vec.i32) -> vec.i64` - -Converts low or high half of the smaller lane vector to a larger lane vector, -sign extended or zero (unsigned) extended. - -```python -def S.widen_low_T(ext, a): - result = S.New() - for i in range(S.Lanes): - result[i] = ext(a[i]) - -def S.widen_high_T(ext, a): - result = S.New() - for i in range(S.Lanes): - result[i] = ext(a[S.Lanes + i]) - -def S.widen_low_T_s(a): - return S.widen_low_T(Sext, a) - -def S.widen_high_T_s(a): - return S.widen_high_T(Sext, a) - -def S.widen_low_T_u(a): - return S.widen_low_T(Zext, a) - -def S.widen_high_T_u(a): - return S.widen_high_T(Zext, a) -``` - diff --git a/proposals/flexible-vectors/FlexibleVectorsSecondTier.md b/proposals/flexible-vectors/FlexibleVectorsSecondTier.md deleted file mode 100644 index f3fb220..0000000 --- a/proposals/flexible-vectors/FlexibleVectorsSecondTier.md +++ /dev/null @@ -1,58 +0,0 @@ -Instructions considered conditionally -===================================== - -This document describes instructions considered conditionally pending -performance data. The reasons are listed in "implementation notes" sections. - -## Operations - -### Floating-point min and max - -These operations are not part of the IEEE 754-2008 standard. They are lane-wise -versions of the existing scalar WebAssembly operations. - -
- Implementation notes - - NaN queting required for these operation is expensive on x86-based platforms. - See [WebAssembly/simd#186](https://github.com/WebAssembly/simd/issues/186). - -
- -#### NaN-propagating minimum - -* `vec.f32.min(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.min(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise minimum value, propagating NaNs. - -#### NaN-propagating maximum - -* `vec.f32.max(a: vec.f32, b: vec.f32) -> vec.f32` -* `vec.f64.max(a: vec.f64, b: vec.f64) -> vec.f64` - -Lane-wise maximum value, propagating NaNs. - -### Conversions - -#### Integer to floating point - -* `vec.f32.convert_u(a: vec.i32) -> vec.f32` -* `vec.f64.convert_u(a: vec.i64) -> vec.f64` - -Lane-wise conversion from integer to floating point. Some integer values will be -rounded. - -#### Floating point to integer with saturation - -* `vec.i32.trunc_sat_s(a: vec.f32) -> vec.i32` -* `vec.i32.trunc_sat_u(a: vec.f32) -> vec.i32` -* `vec.i64.trunc_sat_s(a: vec.f64) -> vec.i64` -* `vec.i64.trunc_sat_u(a: vec.f64) -> vec.i64` - -Lane-wise saturating conversion from floating point to integer using the IEEE -`convertToIntegerTowardZero` function. If any input lane is a NaN, the -resulting lane is 0. If the rounded integer value of a lane is outside the -range of the destination type, the result is saturated to the nearest -representable integer value. - diff --git a/proposals/flexible-vectors/FlexibleVectorsThirdTier.md b/proposals/flexible-vectors/FlexibleVectorsThirdTier.md deleted file mode 100644 index 425c195..0000000 --- a/proposals/flexible-vectors/FlexibleVectorsThirdTier.md +++ /dev/null @@ -1,47 +0,0 @@ -Instructions considered conditionally -===================================== - -This document describes instructions considered conditionally pending -performance data with implications for other instructions in the proposal. The -reasons are listed in "implementation notes" sections. - -## Operations - -### Setting vector length - -
- Implementation notes - - -- 8-bit lanes - - `vec.i8.set_length(len: i32) -> i32` - - `vec.i8.set_length_imm(imm: ImmLaneIdx8) -> i32` -- 16-bit lanes - - `vec.i16.set_length(len: i32) -> i32` - - `vec.i16.set_length_imm(imm: ImmLaneIdx16) -> i32` -- 32-bit lanes - - `vec.i32.set_length(len: i32) -> i32` - - `vec.i32.set_length_imm(imm: ImmLaneIdx32) -> i32` - - `vec.f32.set_length(len: i32) -> i32` - - `vec.f32.set_length_imm(imm: ImmLaneIdx32) -> i32` -- 64-bit lanes - - `vec.i64.set_length(len: i32) -> i32` - - `vec.i64.set_length_imm(imm: ImmLaneIdx64) -> i32` - - `vec.f64.set_length(len: i32) -> i32` - - `vec.f64.set_length_imm(imm: ImmLaneIdx64) -> i32` - -The above operations set the number of lanes for corresponding vector type to -the minimum of supported vector length and the requested length. The length is -then returned on the stack. - -This sets number of lanes for vector operations working on corresponding vector -types. Setting vector length to zero turns corresponding vector operations -(aside of set length) into NOPs. - diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index 86fd0ce..2a74d54 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -1,7 +1,1126 @@ -## Flexible vectors proposal +# Concrete model for flexible vectors -- Instruction descriptions - - [FlexibleVectors.md](FlexibleVectors.md) - - [FlexibleVectorsSecondTier.md](FlexibleVectorsSecondTier.md) - - [FlexibleVectorsThirdTier.md](FlexibleVectorsThirdTier.md) +## Terms +- (WASM) compiler: A compiler reads a program source code (eg: C) and translates it into WASM bytecode. +- target architecture: the architecture where the WASM bytecode is ran. +- WASM engine: An engine reads WASM bytecode and translates it into native instructions for the architecture it is currently running on, and runs them. +- element width: size in bits of the element (`float` is 32-bit wide). +- SIMD width: size in bits of the hardware SIMD registers. +- SIMD length: number of elements that can fit within a hardware SIMD register (SIMD width / element width). +- lane: element within a vector + +## General goal + +The core of the flexible vectors proposal is to a have a single bytecode that can target multiple architectures with different SIMD width. +This should be as efficient as possible (ie: it should run as fast as possible). +In order to do that, flexible vectors should abstract the SIMD width away. + +List of potential targets: + +| ISA | SIMD width | +|:--------------|:-------------| +| SSE - SSE4.2 | 128 | +| AVX - AVX2 | 256 | +| AVX512 - | 512 | +| Altivec - VSX | 128 | +| Neon | 128 | +| SVE | 128 - 2048 | +| Risc-V V | 128 - 65536? | + +I propose that SIMD width is accessible by WASM bytecode, and have the following property: SIMD width is a runtime constant, and thus cannot change during the execution of the program. +Therefore, all vectors manipulated by the program have the exact same width. +Smaller vectors are handled using masks. +As a consequence, a WASM compiler cannot assume any particular value, but can assume it will not change. +A WASM engine could optimize the bytecode for the target architecture (constant folding) where the actual SIMD width is known. + + +We could impose more constraints on the SIMD width: + +- It should be a multiple of 128 bits. +- It is a power of 2? (might be problematic to target SVE) +- It cannot be wider than 2048? (might be problematic to target Risc-V V) + +## SIMD types + +Vector types: + +- `vec.v8`: vector of 8-bit elements +- `vec.v16`: vector of 16-bit elements +- `vec.v32`: vector of 32-bit elements +- `vec.v64`: vector of 64-bit elements +- `vec.v128`: vector of 128-bit elements + +Mask types: + +- `vec.m8`: mask for a vector of 8-bit elements +- `vec.m16`: mask for a vector of 16-bit elements +- `vec.m32`: mask for a vector of 32-bit elements +- `vec.m64`: mask for a vector of 64-bit elements +- `vec.m128`: mask for a vector of 128-bit elements + +Vector types can be interpreted in multiple ways: + +| vector type | interpretations | +|:------------|:---------------------------------------------------| +| `vec.v8` | `vec.i8` | +| `vec.v16` | `vec.i16` | +| `vec.v32` | `vec.i32`, `vec.f32` | +| `vec.v64` | `vec.i64`, `vec.f64` | +| `vec.v128` | `vec.v8x16`, `vec.v16x8`, `vec.v32x4`, `vec.v64x2` | + +Mask types are not vector types, and their actual representation differ depending on the target architecture. +In particular, `vec.m8` and `vec.m16` might have different architectural sizes (or actually be the same). +In a similar fashion, `vec.v8` and `vec.m8` might also have different architectural sizes (or actually be the same). + +## Immediate LaneIdx operands + +As the vector length is not known at compile time, it does not make much sense to specify lane indices as immediates. + +Therefore, the type `i32` will be used to specify a lane index. +The range of a lane index is no further constrained, and is intepreted modulo the actual vector length. + +If we impose an upper bound on the SIMD width, we can further constraint the range of lane indices. +For instance, a maximum width of 2048 would allow to store a lane index in a 8-bit integer. +Similarily, a maximum width of 524288 would allow to store a lane index in a 16-bit integer. +The would not change the instruction encoding as they all take runtime lane indices which are always `i32`. + +## Special Operations + +### Vector length + +Returns the number of elements. +Returned value will always be the same during the whole execution of the program. + +- `vec.v8.length -> i32` +- `vec.v16.length -> i32` +- `vec.v32.length -> i32` +- `vec.v64.length -> i32` +- `vec.v128.length -> i32` + +### Mask count + +Returns the number of active lanes of a `mask` + +- `vec.m8.count(m: vec.m8) -> i32` +- `vec.m16.count(m: vec.m16) -> i32` +- `vec.m32.count(m: vec.m32) -> i32` +- `vec.m64.count(m: vec.m64) -> i32` +- `vec.m128.count(m: vec.m128) -> i32` + +```python +def mask.S.count(m): + result = 0 + for i in range(mask.S.length): + if m[i]: + result += 1 + return result +``` + +### Mask all active + +Returns a mask where all lanes are active + +- `vec.m8.all -> vec.m8` +- `vec.m16.all -> vec.m16` +- `vec.m32.all -> vec.m32` +- `vec.m64.all -> vec.m64` +- `vec.m128.all -> vec.m128` + +```python +def mask.S.all(): + result = mask.S.New() + for i in range(mask.S.length): + result[i] = 1 + return result +``` + +### Mask None active + +Returns a mask where no lane is active + +- `vec.m8.none -> vec.m8` +- `vec.m16.none -> vec.m16` +- `vec.m32.none -> vec.m32` +- `vec.m64.none -> vec.m64` +- `vec.m128.none -> vec.m128` + +```python +def mask.S.none(): + result = mask.S.New() + for i in range(mask.S.length): + result[i] = 0 + return result +``` + +### Mask index CMP + +Returns a mask whose active lanes satisfy `(x + laneIdx) CMP n` +CMP one of the following: `eq`, `ne`, `lt`, `le`, `gt`, `ge` + +- `vec.m8.index_CMP(x: i32, n: i32) -> vec.m8` +- `vec.m16.index_CMP(x: i32, n: i32) -> vec.m16` +- `vec.m32.index_CMP(x: i32, n: i32) -> vec.m32` +- `vec.m64.index_CMP(x: i32, n: i32) -> vec.m64` +- `vec.m128.index_CMP(x: i32, n: i32) -> vec.m128` + +```python +def vec.S.index_CMP(x, n): + result = vec.S.New() + for i in range(vec.S.length): + if (x + i) CMP n: + result[i] = 1 + else: + result[i] = 0 + return result +``` + +### Mask index first + +Returns the index of the first active lane. +If there is no active lane, the index of the last lane is returned. + +- `vec.m8.index_first(m: vec.m8) -> i32` +- `vec.m16.index_first(m: vec.m16) -> i32` +- `vec.m32.index_first(m: vec.m32) -> i32` +- `vec.m64.index_first(m: vec.m64) -> i32` +- `vec.m128.index_first(m: vec.m128) -> i32` + +```python +def vec.S.index_first(m): + for i in range(vec.S.length): + if m[i]: + return i + return vec.S.length - 1 +``` + +### Mask index last + +Returns the index of the last active lane. +If there is no active lane, the index of the first lane is returned. + +- `vec.m8.index_last(m: vec.m8) -> i32` +- `vec.m16.index_last(m: vec.m16) -> i32` +- `vec.m32.index_last(m: vec.m32) -> i32` +- `vec.m64.index_last(m: vec.m64) -> i32` +- `vec.m128.index_last(m: vec.m128) -> i32` + +```python +def vec.S.index_last(m): + idx = 0 + for i in range(vec.S.length): + if m[i]: + idx = i + return idx +``` + +### Mask first + +Returns a mask with a single active lane that corresponds to the first active lane of the input. +If there is no active lanes, then an empty mask is returned. + +- `vec.m8.first(m: vec.m8) -> vec.m8` +- `vec.m16.first(m: vec.m16) -> vec.m16` +- `vec.m32.first(m: vec.m32) -> vec.m32` +- `vec.m64.first(m: vec.m64) -> vec.m64` +- `vec.m128.first(m: vec.m128) -> vec.m128` + +```python +def vec.S.first(m): + result = mask.S.New() + for i in range(vec.S.length): + if m[i]: + result[i] = 1 + break + return result +``` + +### Mask last + +Returns a mask with a single active lane that corresponds to the last active lane of the input. +If there is no active lanes, then an empty mask is returned. + +- `vec.m8.last(m: vec.m8) -> vec.m8` +- `vec.m16.last(m: vec.m16) -> vec.m16` +- `vec.m32.last(m: vec.m32) -> vec.m32` +- `vec.m64.last(m: vec.m64) -> vec.m64` +- `vec.m128.last(m: vec.m128) -> vec.m128` + +```python +def vec.S.last(m): + result = mask.S.New() + idx = -1 + for i in range(vec.S.length): + if m[i]: + idx = i + if idx >= 0: + result[idx] = 1 + return result +``` + + +## Memory operations + +### Vector load + +- `vec.v8.load(a: memarg) -> vec.v8` +- `vec.v16.load(a: memarg) -> vec.v16` +- `vec.v32.load(a: memarg) -> vec.v32` +- `vec.v64.load(a: memarg) -> vec.v64` +- `vec.v128.load(a: memarg) -> vec.v128` + +### Vector load mask zero + +Inactive elements are set to `0` + +- `vec.v8.load_mz(m: vec.m8, a: memarg) -> vec.v8` +- `vec.v16.load_mz(m: vec.m16, a: memarg) -> vec.v16` +- `vec.v32.load_mz(m: vec.m32, a: memarg) -> vec.v32` +- `vec.v64.load_mz(m: vec.m64, a: memarg) -> vec.v64` +- `vec.v128.load_mz(m: vec.m128, a: memarg) -> vec.v128` + +### Vector load mask undefined + +Inactive elements have undefined values + +- `vec.v8.load_mx(m: vec.m8, a: memarg) -> vec.v8` +- `vec.v16.load_mx(m: vec.m16, a: memarg) -> vec.v16` +- `vec.v32.load_mx(m: vec.m32, a: memarg) -> vec.v32` +- `vec.v64.load_mx(m: vec.m64, a: memarg) -> vec.v64` +- `vec.v128.load_mx(m: vec.m128, a: memarg) -> vec.v128` + +### Vector load splat + +- `vec.v8.load_splat(a: memarg) -> vec.v8` +- `vec.v16.load_splat(a: memarg) -> vec.v16` +- `vec.v32.load_splat(a: memarg) -> vec.v32` +- `vec.v64.load_splat(a: memarg) -> vec.v64` +- `vec.v128.load_splat(a: memarg) -> vec.v128` + +### Vector store + +- `vec.v8.store(a: memarg, v: vec.v8)` +- `vec.v16.store(a: memarg, v: vec.v16)` +- `vec.v32.store(a: memarg, v: vec.v32)` +- `vec.v64.store(a: memarg, v: vec.v64)` +- `vec.v128.store(a: memarg, v: vec.v128)` + +### Vector store mask + +Inactive elements are not stored + +- `vec.v8.m_store(m: vec.m8, a: memarg, v: vec.v8)` +- `vec.v16.m_store(m: vec.m16, a: memarg, v: vec.v16)` +- `vec.v32.m_store(m: vec.m32, a: memarg, v: vec.v32)` +- `vec.v64.m_store(m: vec.m64, a: memarg, v: vec.v64)` +- `vec.v128.m_store(m: vec.m128, a: memarg, v: vec.v128)` + +## Lane operations + +### Splat scalar + +`idx` is interpreted modulo the length of the vector. + +- `vec.i8.splat(v: vec.v8, x: i32) -> vec.v8` +- `vec.i16.splat(v: vec.v16, x: i32) -> vec.v16` +- `vec.i32.splat(v: vec.v32, x: i32) -> vec.v32` +- `vec.f32.splat(v: vec.v32, x: f32) -> vec.32` +- `vec.i64.splat(v: vec.v64, x: i64) -> vec.v64` +- `vec.f64.splat(v: vec.v64, x: f64) -> vec.v64` +- `vec.v128.splat(v: vec.v128, x: v128) -> vec.v128` + + +### Extract lane + +`idx` is interpreted modulo the length of the vector. + +- `vec.s8.extract_lane(v: vec.v8, idx: i32) -> i32` +- `vec.u8.extract_lane(v: vec.v8, idx: i32) -> i32` +- `vec.s16.extract_lane(v: vec.v16, idx: i32) -> i32` +- `vec.u16.extract_lane(v: vec.v16, idx: i32) -> i32` +- `vec.i32.extract_lane(v: vec.v32, idx: i32) -> i32` +- `vec.f32.extract_lane(v: vec.v32, idx: i32) -> f32` +- `vec.i64.extract_lane(v: vec.v64, idx: i32) -> i64` +- `vec.f64.extract_lane(v: vec.v64, idx: i32) -> f64` +- `vec.v128.extract_lane(v: vec.v128, idx: i32) -> v128` + +### Replace lane + +`idx` is interpreted modulo the length of the vector. + +- `vec.i8.extract_lane(v: vec.v8, idx: i32, x: i32) -> vec.v8` +- `vec.i16.extract_lane(v: vec.v16, idx: i32, x: i32) -> vec.v16` +- `vec.i32.extract_lane(v: vec.v32, idx: i32, x: i32) -> vec.v32` +- `vec.f32.extract_lane(v: vec.v32, idx: i32, x: f32) -> vec.v32` +- `vec.i64.extract_lane(v: vec.v64, idx: i32, x: i64) -> vec.v64` +- `vec.f64.extract_lane(v: vec.v64, idx: i32, x: f64) -> vec.v64` +- `vec.v128.extract_lane(v: vec.v128, idx: i32, x: v128) -> vec.v128` + +### Load lane + +Loads a single lane into existing vector. +`idx` is interpreted modulo the length of the vector. + +- `vec.v8.load_lane(a: memarg, v: vec.v8, idx: i32) -> vec.v8` +- `vec.v16.load_lane(a: memarg, v: vec.v16, idx: i32) -> vec.v16` +- `vec.v32.load_lane(a: memarg, v: vec.v32, idx: i32) -> vec.v32` +- `vec.v64.load_lane(a: memarg, v: vec.v64, idx: i32) -> vec.v64` +- `vec.v128.load_lane(a: memarg, v: vec.v128, idx: i32) -> vec.v128` + +### Store lane + +Stores a single lane from vector +`idx` is interpreted modulo the length of the vector. + +- `vec.v8.store_lane(a: memarg, v: vec.v8, idx: i32)` +- `vec.v16.store_lane(a: memarg, v: vec.v16, idx: i32)` +- `vec.v32.store_lane(a: memarg, v: vec.v32, idx: i32)` +- `vec.v64.store_lane(a: memarg, v: vec.v64, idx: i32)` +- `vec.v128.store_lane(a: memarg, v: vec.v128, idx: i32)` + + +## Unary Arithmetic operators + +UNOP designates any unary operator (eg: neg, not) + +### UNOP + +- `vec.v8.UNOP(a: vec.v8) -> vec.v8` +- `vec.v16.UNOP(a: vec.v16) -> vec.v16` +- `vec.v32.UNOP(a: vec.v32) -> vec.v32` +- `vec.v64.UNOP(a: vec.v64) -> vec.v64` +- `vec.v128.UNOP(a: vec.v128) -> vec.v128` + +- `vec.m8.UNOP(a: vec.m8) -> vec.m8` +- `vec.m16.UNOP(a: vec.m16) -> vec.m16` +- `vec.m32.UNOP(a: vec.m32) -> vec.m32` +- `vec.m64.UNOP(a: vec.m64) -> vec.m64` +- `vec.m128.UNOP(a: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. + +### UNOP mask zero + +Inactive lanes are set to zero. + +- `vec.v8.UNOP_mz(m: vec.m8, a: vec.v8) -> vec.v8` +- `vec.v16.UNOP_mz(m: vec.m16, a: vec.v16) -> vec.v16` +- `vec.v32.UNOP_mz(m: vec.m32, a: vec.v32) -> vec.v32` +- `vec.v64.UNOP_mz(m: vec.m64, a: vec.v64) -> vec.v64` +- `vec.v128.UNOP_mz(m: vec.m128, a: vec.v128) -> vec.v128` + +### UNOP mask merge + +Inactive lanes are left untouched. + +- `vec.v8.UNOP_mm(m: vec.m8, a: vec.v8) -> vec.v8` +- `vec.v16.UNOP_mm(m: vec.m16, a: vec.v16) -> vec.v16` +- `vec.v32.UNOP_mm(m: vec.m32, a: vec.v32) -> vec.v32` +- `vec.v64.UNOP_mm(m: vec.m64, a: vec.v64) -> vec.v64` +- `vec.v128.UNOP_mm(m: vec.m128, a: vec.v128) -> vec.v128` + +### UNOP mask undefined + +Inactive lanes are undefined. + +- `vec.v8.UNOP_mx(m: vec.m8, a: vec.v8) -> vec.v8` +- `vec.v16.UNOP_mx(m: vec.m16, a: vec.v16) -> vec.v16` +- `vec.v32.UNOP_mx(m: vec.m32, a: vec.v32) -> vec.v32` +- `vec.v64.UNOP_mx(m: vec.m64, a: vec.v64) -> vec.v64` +- `vec.v128.UNOP_mx(m: vec.m128, a: vec.v128) -> vec.v128` + +## Binary Arithmetic operators + +BINOP designates any binary operator that is not a comparison (eg: add, sub, rsub, mul, div, rdiv, and, or, xor...) + +### Select mask + +Selects active elements from `a` and inactive elements from `b`. + +- `vec.v8.select(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.select(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.select(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.select(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.select(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` + +### BINOP + +- `vec.v8.BINOP(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.BINOP(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.BINOP(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.BINOP(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.BINOP(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.BINOP(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.BINOP(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.BINOP(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.BINOP(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.BINOP(a: vec.m128, b: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. + +### BINOP mask zero + +Inactive elements are set to zero. + +- `vec.v8.BINOP_mz(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.BINOP_mz(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.BINOP_mz(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.BINOP_mz(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.BINOP_mz(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.BINOP_mz(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.BINOP_mz(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.BINOP_mz(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.BINOP_mz(m: vec.m64, a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.BINOP_mz(m: vec.m128, a: vec.m128, b: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. + +### BINOP mask merge + +Inactive elements are forwarded from `a`. + +- `vec.v8.BINOP_mm(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.BINOP_mm(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.BINOP_mm(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.BINOP_mm(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.BINOP_mm(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.BINOP_mm(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.BINOP_mm(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.BINOP_mm(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.BINOP_mm(m: vec.m64, a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.BINOP_mm(m: vec.m128, a: vec.m128, b: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. + +### BINOP mask undefined + +Inactive elements are undefined. + +- `vec.v8.BINOP_mx(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.BINOP_mx(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.BINOP_mx(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.BINOP_mx(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.BINOP_mx(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` + + +## Comparisons + +CMP designates any comparison operator (eg: `eq_u`, `ne_s`, `lt_f`, `le_s`, `gt_f`, `ge_u`) + +### CMP + +- `vec.m8.CMP(a: vec.v8, b: vec.v8) -> vec.m8` +- `vec.m16.CMP(a: vec.v16, b: vec.v16) -> vec.m16` +- `vec.m32.CMP(a: vec.v32, b: vec.v32) -> vec.m32` +- `vec.m64.CMP(a: vec.v64, b: vec.v64) -> vec.m64` + +```python +def vec.S.CMP(a, b): + result = mask.S.New() + for i in range(mask.S.length): + if a[i] CMP b[i]: + result[i] = 1 + else: + result[i] = 0 + return result +``` + +### CMP mask + +Inactive elements are set to `0`. + +- `vec.m8.CMP_m(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.m8` +- `vec.m16.CMP_m(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.m16` +- `vec.m32.CMP_m(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.m32` +- `vec.m64.CMP_m(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.m64` + +```python +def vec.S.CMP_m(m, a, b): + result = mask.S.New() + for i in range(mask.S.length): + if m[i] and a[i] CMP b[i]: + result[i] = 1 + else: + result[i] = 0 + return result +``` + +### Sign to mask + +For each lane, the mask lane is set to `1` if the element is negative (floats are not interpreted), and `0` otherwise. + +- `vec.m8.sign(a: vec.v8) -> vec.m8` +- `vec.m16.sign(a: vec.v16) -> vec.m16` +- `vec.m32.sign(a: vec.v32) -> vec.m32` +- `vec.m64.sign(a: vec.v64) -> vec.m64` + +## Inter-lane operations + +### LUT1 zero + +Elements whose index is out of bounds are set to `0`. + +- `vec.v8.lut1_z(idx: vec.v8, a: vec.v8) -> vec.v8` +- `vec.v16.lut1_z(idx: vec.v16, a: vec.v16) -> vec.v16` +- `vec.v32.lut1_z(idx: vec.v32, a: vec.v32) -> vec.v32` +- `vec.v64.lut1_z(idx: vec.v64, a: vec.v64) -> vec.v64` +- `vec.v128.lut1_z(idx: vec.v128, a: vec.v128) -> vec.v128` + +```python +def vec.S.lut1_z(idx, a): + result = vec.S.New() + for i in range(vec.S.length): + if idx[i] < vec.S.length: + result[i] = a[idx[i]] + else: + result[i] = 0 + return result +``` +### LUT1 merge + +Elements whose index is out of bounds are taken from fallback. + +- `vec.v8.lut1_m(idx: vec.v8, a: vec.v8, fallback: vec.v8) -> vec.v8` +- `vec.v16.lut1_m(idx: vec.v16, a: vec.v16, fallback: vec.v16) -> vec.v16` +- `vec.v32.lut1_m(idx: vec.v32, a: vec.v32, fallback: vec.v32) -> vec.v32` +- `vec.v64.lut1_m(idx: vec.v64, a: vec.v64, fallback: vec.v64) -> vec.v64` + +```python +def vec.S.lut1_m(idx, a, fallback): + result = vec.S.New() + for i in range(vec.S.length): + if idx[i] < vec.S.length: + result[i] = a[idx[i]] + else: + result[i] = fallback[i] + return result +``` + +### V128 shuffle + +Applies shuffle to each v128 of the vector. + +- `vec.i8x16.shuffle(a: vec.v128, b: vec.v128, imm: ImmLaneIdx32[16]) -> vec.v128` + +```python +def vec.i8x16.shuffle(a, b, imm): + result = vec.v128.New() + for i in range(vec.v128.length): + result[i] = i8x16.shuffle(a[i], b[i], imm) + return result +``` + +### V128 swizzle + +Applies swizzle to each v128 of the vector. + +- `vec.i8x16.swizzle(a: vec.v128, s: vec.v128) -> vec.v128` + +```python +def vec.i8x16.swizzle(idx, a, s): + result = vec.v128.New() + for i in range(vec.v128.length): + result[i] = i8x16.swizzle(a[i], s[i], imm) + return result +``` + +### Splat lane + +Gets a single lane from vector and broadcast it to the entire vector. +`idx` is interpreted modulo the cardinal of the vector. + +- `vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8` +- `vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16` +- `vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32` +- `vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64` +- `vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128` + +```python +def vec.S.splat_lane(v, imm): + idx = idx % vec.S.length + result = vec.S.New() + for i in range(vec.S.length): + result[i] = v[idx] + return result +``` + +### Concat + +Copies elements from vector `a` from first active element to last active element. +Inner inactive elements are also copied. +The remaining elements are set from the first elements from `b`. + +- `vec.v8.concat(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.concat(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.concat(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.concat(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.concat(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` + + +```python +def vec.S.concat(m, a, b): + begin = -1 + end = -1 + for i in range(vec.S.length): + if m[i]: + end = i + 1 + if begin < 0: + begin = i + + result = vec.S.New() + i = 0 + for j in range(begin, end): + result[i] = a[j] + i += 1 + for j in range(0, vec.S.length - i): + result[i] = b[j] + i += 1 + return result +``` + +### Lane shift + +Concats the 2 input vector to form a single double-width vector. +Shifts this double-width vector by `n` lane to the left (to LSB). +Extracts the lower half of the shifted vector. +`n` is interpreted modulo the length of the vector. + + +- `vec.v8.lane_shift(a: vec.v8, b: vec.v8, n: i32) -> vec.v8` +- `vec.v16.lane_shift(a: vec.v16, b: vec.v16, n: i32) -> vec.v16` +- `vec.v32.lane_shift(a: vec.v32, b: vec.v32, n: i32) -> vec.v32` +- `vec.v64.lane_shift(a: vec.v64, b: vec.v64, n: i32) -> vec.v64` +- `vec.v128.lane_shift(a: vec.v128, b: vec.v128, n: i32) -> vec.v128` + +```python +def vec.S.lane_shift(a, b, n): + result = vec.S.New() + n = n % vec.S.length + for i in range(0, n): + result[i] = a[i + n] + for i in range(n, vec.S.length): + result[i] = b[i - n] + return result +``` + +### Interleave even + +Extracts even elements from both input and interleaves them. + +- `vec.v8.interleave_even(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.interleave_even(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8` +- `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16` +- `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32` +- `vec.m64.interleave_even(a: vec.m64, b: mec.m64) -> vec.m64` +- `vec.m128.interleave_even(a: vec.m128, b: mec.m128) -> vec.m128` + + +```python +def vec.S.interleave_even(a, b): + result = vec.S.New() + for i in range(vec.S.length/2): + result[2*i] = a[2*i] + result[2*i + 1] = b[2*i] + return result +``` + +Note: + +> - can be implemented with `TRN1` on Neon/SVE + +### Interleave odd + +Extracts odd elements from both input and interleaves them. + +- `vec.v8.interleave_odd(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.interleave_odd(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.interleave_odd(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.interleave_odd(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.interleave_odd(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.interleave_odd(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.interleave_odd(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.interleave_odd(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.interleave_odd(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.interleave_odd(a: vec.m128, b: vec.m128) -> vec.m128` + + +```python +def vec.S.interleave_odd(a, b): + result = vec.S.New() + for i in range(vec.S.length/2): + result[2*i] = a[2*i+1] + result[2*i + 1] = b[2*i+1] + return result +``` + +Note: + +> - can be implemented with `TRN2` on Neon/SVE + +### Concat even + +Extracts even elements from both input and concatenate them. + +- `vec.v8.concat_even(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.concat_even(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.concat_even(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.concat_even(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.concat_even(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.concat_even(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.concat_even(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.concat_even(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.concat_even(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.concat_even(a: vec.m128, b: vec.m128) -> vec.m128` + + +```python +def vec.S.concat_even(a, b): + result = vec.S.New() + + for i in range(vec.S.length/2): + result[i] = a[2*i] + for i in range(vec.S.length/2): + result[i + vec.S.length/2] = b[2*i] + return result +``` + +Note: + +> - can be implemented with `UZP1` on Neon/SVE +> - Wrapping narrowing integer conversions could be implemented with this function + +### Concat odd + +Extracts odd elements from both input and concatenate them. + +- `vec.v8.concat_odd(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.concat_odd(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.concat_odd(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.concat_odd(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.concat_odd(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.concat_odd(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.concat_odd(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.concat_odd(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.concat_odd(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.concat_odd(a: vec.m128, b: vec.m128) -> vec.m128` + + +```python +def vec.S.concat_odd(a, b): + result = vec.S.New() + + for i in range(vec.S.length/2): + result[i] = a[2*i+1] + for i in range(vec.S.length/2): + result[i + vec.S.length/2] = b[2*i+1] + return result +``` + +Note: + +> - can be implemented with `UZP2` on Neon/SVE + +### Interleave low + +Extracts the lower half of both input and interleaves their elements. + +- `vec.v8.interleave_low(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.interleave_low(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.interleave_low(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.interleave_low(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.interleave_low(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.interleave_low(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.interleave_low(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.interleave_low(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.interleave_low(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.interleave_low(a: vec.m128, b: vec.m128) -> vec.m128` + + +```python +def vec.S.interleave_low(a, b): + result = vec.S.New() + for i in range(vec.S.length/2): + result[2*i] = a[i] + result[2*i + 1] = b[i] + return result +``` + +Note: + +> - can be implemented with `ZIP1` on Neon/SVE + +### Interleave high + +Extracts the higher half of both input and interleaves their elements. + +- `vec.v8.interleave_high(a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.interleave_high(a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.interleave_high(a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.interleave_high(a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.interleave_high(a: vec.v128, b: vec.v128) -> vec.v128` + +- `vec.m8.interleave_high(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.interleave_high(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.interleave_high(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.interleave_high(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.interleave_high(a: vec.m128, b: vec.m128) -> vec.m128` + + +```python +def vec.S.interleave_high(a, b): + result = vec.S.New() + for i in range(vec.S.length/2): + result[2*i] = a[i + vec.S.length/2] + result[2*i + 1] = b[i + vec.S.length/2] + return result +``` + +Note: + +> - can be implemented with `ZIP2` on Neon/SVE + +## Conversions + +### Narrowing conversions + +Converts each elements of both inputs to narrower types using saturation, and concats them. + +- `vec.i8.narrow_i16_u(a: vec.v16, b: vec.v16) -> vec.v8` +- `vec.i8.narrow_i16_s(a: vec.v16, b: vec.v16) -> vec.v8` +- `vec.i16.narrow_i32_u(a: vec.v32, b: vec.v32) -> vec.v16` +- `vec.i16.narrow_i32_s(a: vec.v32, b: vec.v32) -> vec.v16` +- `vec.i32.narrow_i64_u(a: vec.v64, b: vec.v64) -> vec.v32` +- `vec.i32.narrow_i64_s(a: vec.v64, b: vec.v64) -> vec.v32` + +### Mask narrowing + +Returns a `mask` for a narrower type with the same active lanes + +- `vec.m8.narrow_m16(a: vec.m16, b: vec.m16) -> vec.m8` +- `vec.m16.narrow_m32(a: vec.m32, b: vec.m32) -> vec.m16` +- `vec.m32.narrow_m64(a: vec.m64, b: vec.m64) -> vec.m32` +- `vec.m64.narrow_m128(a: vec.m128, b: vec.m128) -> vec.m64` + +```python +def mask.S.narrow(a, b): + result = mask.S.New() + for i in range(mask.S.length/2): + result[i] = a[i] + for i in range(mask.S.length/2): + result[i + mask.S.length/2] = a[i] + return result +``` + +### Widening conversions + +- `vec.i16.widen_low_i8_u(a: vec.v8) -> vec.v16` +- `vec.i16.widen_low_i8_s(a: vec.v8) -> vec.v16` +- `vec.i32.widen_low_i16_u(a: vec.v8) -> vec.v32` +- `vec.i32.widen_low_i16_s(a: vec.v8) -> vec.v32` +- `vec.i64.widen_low_i32_u(a: vec.v8) -> vec.v64` +- `vec.i64.widen_low_i32_s(a: vec.v8) -> vec.v64` + +- `vec.i16.widen_high_i8_u(a: vec.v8) -> vec.v16` +- `vec.i16.widen_high_i8_s(a: vec.v8) -> vec.v16` +- `vec.i32.widen_high_i16_u(a: vec.v8) -> vec.v32` +- `vec.i32.widen_high_i16_s(a: vec.v8) -> vec.v32` +- `vec.i64.widen_high_i32_u(a: vec.v8) -> vec.v64` +- `vec.i64.widen_high_i32_s(a: vec.v8) -> vec.v64` + +### Mask widening + +Returns a `mask` for a wider type with the same active lanes as the lower/higher part of the original `mask` + +- `vec.m16.widen_low_m8(m: vec.m8) -> vec.m16` +- `vec.m32.widen_low_m16(m: vec.m16) -> vec.m32` +- `vec.m64.widen_low_m32(m: vec.m32) -> vec.m64` +- `vec.m128.widen_low_m64(m: vec.m64) -> vec.m128` + +- `vec.m16.widen_high_m8(m: vec.m8) -> vec.m16` +- `vec.m32.widen_high_m16(m: vec.m16) -> vec.m32` +- `vec.m64.widen_high_m32(m: vec.m32) -> vec.m64` +- `vec.m128.widen_high_m64(m: vec.m64) -> vec.m128` + +```python +def vec.S.widen_low_T(m): + result = vec.S.New() + for i in range(vec.S.length): + result[i] = m[i] + return result +``` + +```python +def mask.S.widen_high_T(m): + result = vec.S.New() + for i in range(vec.S.length): + result[i] = m[i + vec.S.length/2] + return result +``` + +### Floating point promotion + +- `vec.f64.promote_low_f32(a: vec.v32) -> vec.v64` +- `vec.f64.promote_high_f32(a: vec.v32) -> vec.v64` + +### Floating point demotion + +- `vec.f32.demote_f64(a: vec.v64, b: vec.v64) -> vec.v32` + +### Integer to single-precision floating point + +- `vec.f32.convert_i32_s(a: vec.v32) -> vec.v32` +- `vec.f32.convert_i32_u(a: vec.v32) -> vec.v32` +- `vec.f32.convert_i64_s(a: vec.v64, b: vec.v64) -> vec.v32` +- `vec.f32.convert_i64_u(a: vec.v64, b: vec.v64) -> vec.v32` + +### Integer to double-precision floating point + +- `vec.f64.convert_low_i32_s(a: vec.v32) -> vec.v64` +- `vec.f64.convert_low_i32_u(a: vec.v32) -> vec.v64` +- `vec.f64.convert_high_i32_s(a: vec.v32) -> vec.v64` +- `vec.f64.convert_high_i32_u(a: vec.v32) -> vec.v64` +- `vec.f64.convert_i64_s(a: vec.v64) -> vec.v64` +- `vec.f64.convert_i64_u(a: vec.v64) -> vec.v64` + +### single precision floating point to integer with saturation + +- `vec.i32.trunc_sat_f32_s(a: vec.v32) -> vec.v32` +- `vec.i32.trunc_sat_f32_u(a: vec.v32) -> vec.v32` +- `vec.i64.trunc_sat_low_f32_s(a: vec.v32) -> vec.v64` +- `vec.i64.trunc_sat_low_f32_u(a: vec.v32) -> vec.v64` +- `vec.i64.trunc_sat_high_f32_s(a: vec.v32) -> vec.v64` +- `vec.i64.trunc_sat_high_f32_u(a: vec.v32) -> vec.v64` + +### double precision floating point to integer with saturation + +- `vec.i32.trunc_sat_f64_s(a: vec.v64, b: vec.v64) -> vec.v32` +- `vec.i32.trunc_sat_f64_u(a: vec.v64, b: vec.v64) -> vec.v32` +- `vec.i64.trunc_sat_f64_s(a: vec.v64) -> vec.v64` +- `vec.i64.trunc_sat_f64_u(a: vec.v64) -> vec.v64` + +### Reinterpret casts + +- `vec.v8.cast_v16(a: vec.v16) -> vec.v8` +- `vec.v8.cast_v32(a: vec.v32) -> vec.v8` +- `vec.v8.cast_v64(a: vec.v64) -> vec.v8` +- `vec.v8.cast_v128(a: vec.v128) -> vec.v8` +- `vec.v16.cast_v8(a: vec.v8) -> vec.v16` +- `vec.v16.cast_v32(a: vec.v32) -> vec.v16` +- `vec.v16.cast_v64(a: vec.v64) -> vec.v16` +- `vec.v16.cast_v128(a: vec.v128) -> vec.v16` +- `vec.v32.cast_v8(a: vec.v8) -> vec.v32` +- `vec.v32.cast_v16(a: vec.v16) -> vec.v32` +- `vec.v32.cast_v64(a: vec.v64) -> vec.v32` +- `vec.v32.cast_v128(a: vec.v128) -> vec.v32` +- `vec.v64.cast_v8(a: vec.v8) -> vec.v64` +- `vec.v64.cast_v16(a: vec.v16) -> vec.v64` +- `vec.v64.cast_v32(a: vec.v32) -> vec.v64` +- `vec.v64.cast_v128(a: vec.v128) -> vec.v64` +- `vec.v128.cast_v8(a: vec.v8) -> vec.v128` +- `vec.v128.cast_v16(a: vec.v16) -> vec.v128` +- `vec.v128.cast_v32(a: vec.v32) -> vec.v128` +- `vec.v128.cast_v64(a: vec.v64) -> vec.v128` + +### Mask cast + +- `vec.m8.cast_m16(m: vec.m16) -> vec.m8` +- `vec.m8.cast_m32(m: vec.m32) -> vec.m8` +- `vec.m8.cast_m64(m: vec.m64) -> vec.m8` +- `vec.m8.cast_m128(m: vec.m128) -> vec.m8` +- `vec.m16.cast_m8(m: vec.m8) -> vec.m16` +- `vec.m16.cast_m32(m: vec.m32) -> vec.m16` +- `vec.m16.cast_m64(m: vec.m64) -> vec.m16` +- `vec.m16.cast_m128(m: vec.m128) -> vec.m16` +- `vec.m32.cast_m8(m: vec.m8) -> vec.m32` +- `vec.m32.cast_m16(m: vec.m16) -> vec.m32` +- `vec.m32.cast_m64(m: vec.m64) -> vec.m32` +- `vec.m32.cast_m128(m: vec.m128) -> vec.m32` +- `vec.m64.cast_m8(m: vec.m8) -> vec.m64` +- `vec.m64.cast_m16(m: vec.m16) -> vec.m64` +- `vec.m64.cast_m32(m: vec.m32) -> vec.m64` +- `vec.m64.cast_m128(m: vec.m128) -> vec.m64` +- `vec.m128.cast_m8(m: vec.m8) -> vec.m128` +- `vec.m128.cast_m16(m: vec.m16) -> vec.m128` +- `vec.m128.cast_m32(m: vec.m32) -> vec.m128` +- `vec.m128.cast_m64(m: vec.m64) -> vec.m128` + + +```python +def vec.S.cast_T(m): + result = vec.S.New() + if vec.T.length < vec.S.length: + d = vec.S.length / vec.T.length + for i in range(vec.T.length): + for j in range(d): + result[i*d + j] = m[i] + else: + d = vec.T.length / vec.S.length + for i in range(vec.S.length): + result[i] = m[i * d] + return result +``` + +### Mask to vec + +Active lanes are to `-1` (all one bits), and inactive lanes are set to 0. + +- `vec.v8.convert_m8(m: vec.m8) -> vec.v8` +- `vec.v16.convert_m16(m: vec.m16) -> vec.v16` +- `vec.v32.convert_m32(m: vec.m32) -> vec.v32` +- `vec.v64.convert_m64(m: vec.m64) -> vec.v64` +- `vec.v128.convert_m128(m: vec.m128) -> vec.v128` + +## Test masks + +### Test none + +Returns `1` if and only if all lanes are inactive. +Returns `0` otherwise. + +- `vec.m8.test_none(m: vec.m8) -> i32` +- `vec.m16.test_none(m: vec.m16) -> i32` +- `vec.m32.test_none(m: vec.m32) -> i32` +- `vec.m64.test_none(m: vec.m64) -> i32` +- `vec.m128.test_none(m: vec.m128) -> i32` + +### Test any + +Returns `1` if and only if there is at least one active lane. +Returns `0` otherwise. + +- `vec.m8.test_any(m: vec.m8) -> i32` +- `vec.m16.test_any(m: vec.m16) -> i32` +- `vec.m32.test_any(m: vec.m32) -> i32` +- `vec.m64.test_any(m: vec.m64) -> i32` +- `vec.m128.test_any(m: vec.m128) -> i32` + +### Test all + +Returns `1` if and only if all lanes are active. +Returns `0` otherwise. + +- `vec.m8.test_all(m: vec.m8) -> i32` +- `vec.m16.test_all(m: vec.m16) -> i32` +- `vec.m32.test_all(m: vec.m32) -> i32` +- `vec.m64.test_all(m: vec.m64) -> i32` +- `vec.m128.test_all(m: vec.m128) -> i32` From 915c6a5c14e798b62045b9be61a0bbf566657179 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 27 Feb 2021 17:32:14 +0100 Subject: [PATCH 2/6] Fix tiny formatting issues with lists --- proposals/flexible-vectors/README.md | 30 +++++++++++++++++----------- 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index 2a74d54..49f6f74 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -390,7 +390,6 @@ UNOP designates any unary operator (eg: neg, not) - `vec.v32.UNOP(a: vec.v32) -> vec.v32` - `vec.v64.UNOP(a: vec.v64) -> vec.v64` - `vec.v128.UNOP(a: vec.v128) -> vec.v128` - - `vec.m8.UNOP(a: vec.m8) -> vec.m8` - `vec.m16.UNOP(a: vec.m16) -> vec.m16` - `vec.m32.UNOP(a: vec.m32) -> vec.m32` @@ -410,6 +409,15 @@ Inactive lanes are set to zero. - `vec.v32.UNOP_mz(m: vec.m32, a: vec.v32) -> vec.v32` - `vec.v64.UNOP_mz(m: vec.m64, a: vec.v64) -> vec.v64` - `vec.v128.UNOP_mz(m: vec.m128, a: vec.v128) -> vec.v128` +- `vec.m8.UNOP_mz(m: vec.m8, a: vec.m8) -> vec.m8` +- `vec.m16.UNOP_mz(m: vec.m16, a: vec.m16) -> vec.m16` +- `vec.m32.UNOP_mz(m: vec.m32, a: vec.m32) -> vec.m32` +- `vec.m64.UNOP_mz(m: vec.m64, a: vec.m64) -> vec.m64` +- `vec.m128.UNOP_mz(m: vec.m128, a: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. ### UNOP mask merge @@ -420,6 +428,15 @@ Inactive lanes are left untouched. - `vec.v32.UNOP_mm(m: vec.m32, a: vec.v32) -> vec.v32` - `vec.v64.UNOP_mm(m: vec.m64, a: vec.v64) -> vec.v64` - `vec.v128.UNOP_mm(m: vec.m128, a: vec.v128) -> vec.v128` +- `vec.m8.UNOP_mm(m: vec.m8, a: vec.m8) -> vec.m8` +- `vec.m16.UNOP_mm(m: vec.m16, a: vec.m16) -> vec.m16` +- `vec.m32.UNOP_mm(m: vec.m32, a: vec.m32) -> vec.m32` +- `vec.m64.UNOP_mm(m: vec.m64, a: vec.m64) -> vec.m64` +- `vec.m128.UNOP_mm(m: vec.m128, a: vec.m128) -> vec.m128` + +Note: + +> - Masks only support bitwise operations. ### UNOP mask undefined @@ -452,7 +469,6 @@ Selects active elements from `a` and inactive elements from `b`. - `vec.v32.BINOP(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.BINOP(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.BINOP(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.BINOP(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.BINOP(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.BINOP(a: vec.m32, b: vec.m32) -> vec.m32` @@ -472,7 +488,6 @@ Inactive elements are set to zero. - `vec.v32.BINOP_mz(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.BINOP_mz(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.BINOP_mz(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.BINOP_mz(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.BINOP_mz(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.BINOP_mz(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32` @@ -492,7 +507,6 @@ Inactive elements are forwarded from `a`. - `vec.v32.BINOP_mm(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.BINOP_mm(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.BINOP_mm(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.BINOP_mm(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.BINOP_mm(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.BINOP_mm(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32` @@ -723,7 +737,6 @@ Extracts even elements from both input and interleaves them. - `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8` - `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16` - `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32` @@ -753,7 +766,6 @@ Extracts odd elements from both input and interleaves them. - `vec.v32.interleave_odd(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.interleave_odd(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.interleave_odd(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.interleave_odd(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.interleave_odd(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.interleave_odd(a: vec.m32, b: vec.m32) -> vec.m32` @@ -783,7 +795,6 @@ Extracts even elements from both input and concatenate them. - `vec.v32.concat_even(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.concat_even(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.concat_even(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.concat_even(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.concat_even(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.concat_even(a: vec.m32, b: vec.m32) -> vec.m32` @@ -816,7 +827,6 @@ Extracts odd elements from both input and concatenate them. - `vec.v32.concat_odd(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.concat_odd(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.concat_odd(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.concat_odd(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.concat_odd(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.concat_odd(a: vec.m32, b: vec.m32) -> vec.m32` @@ -848,7 +858,6 @@ Extracts the lower half of both input and interleaves their elements. - `vec.v32.interleave_low(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.interleave_low(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.interleave_low(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.interleave_low(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.interleave_low(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.interleave_low(a: vec.m32, b: vec.m32) -> vec.m32` @@ -878,7 +887,6 @@ Extracts the higher half of both input and interleaves their elements. - `vec.v32.interleave_high(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.interleave_high(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.interleave_high(a: vec.v128, b: vec.v128) -> vec.v128` - - `vec.m8.interleave_high(a: vec.m8, b: vec.m8) -> vec.m8` - `vec.m16.interleave_high(a: vec.m16, b: vec.m16) -> vec.m16` - `vec.m32.interleave_high(a: vec.m32, b: vec.m32) -> vec.m32` @@ -939,7 +947,6 @@ def mask.S.narrow(a, b): - `vec.i32.widen_low_i16_s(a: vec.v8) -> vec.v32` - `vec.i64.widen_low_i32_u(a: vec.v8) -> vec.v64` - `vec.i64.widen_low_i32_s(a: vec.v8) -> vec.v64` - - `vec.i16.widen_high_i8_u(a: vec.v8) -> vec.v16` - `vec.i16.widen_high_i8_s(a: vec.v8) -> vec.v16` - `vec.i32.widen_high_i16_u(a: vec.v8) -> vec.v32` @@ -955,7 +962,6 @@ Returns a `mask` for a wider type with the same active lanes as the lower/higher - `vec.m32.widen_low_m16(m: vec.m16) -> vec.m32` - `vec.m64.widen_low_m32(m: vec.m32) -> vec.m64` - `vec.m128.widen_low_m64(m: vec.m64) -> vec.m128` - - `vec.m16.widen_high_m8(m: vec.m8) -> vec.m16` - `vec.m32.widen_high_m16(m: vec.m16) -> vec.m32` - `vec.m64.widen_high_m32(m: vec.m32) -> vec.m64` From 427d8f5c02fa5531430352f9bc5c139bb08c265d Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 27 Feb 2021 17:32:19 +0100 Subject: [PATCH 3/6] Added LUT2, improved LUT1 description --- proposals/flexible-vectors/README.md | 56 +++++++++++++++++++++++++++- 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index 49f6f74..ff44a05 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -583,6 +583,7 @@ For each lane, the mask lane is set to `1` if the element is negative (floats ar ### LUT1 zero +Gets elements from `a` located at the index specified by `idx`. Elements whose index is out of bounds are set to `0`. - `vec.v8.lut1_z(idx: vec.v8, a: vec.v8) -> vec.v8` @@ -601,9 +602,11 @@ def vec.S.lut1_z(idx, a): result[i] = 0 return result ``` + ### LUT1 merge -Elements whose index is out of bounds are taken from fallback. +Gets elements from `a` located at the index specified by `idx`. +Elements whose index is out of bounds are taken from `fallback`. - `vec.v8.lut1_m(idx: vec.v8, a: vec.v8, fallback: vec.v8) -> vec.v8` - `vec.v16.lut1_m(idx: vec.v16, a: vec.v16, fallback: vec.v16) -> vec.v16` @@ -621,6 +624,55 @@ def vec.S.lut1_m(idx, a, fallback): return result ``` +### LUT2 zero + +Gets elements from `a` and `b` located at the index specified by `idx`. +If the index is lower than length, elements are taken from `a`, if index is between length and 2 * length, elements are taken from `b`. +Elements whose index is out of bounds are set to `0`. + +- `vec.v8.lut2_z(idx: vec.v8, a: vec.v8, b: vec.v8) -> vec.v8` +- `vec.v16.lut2_z(idx: vec.v16, a: vec.v16, b: vec.v16) -> vec.v16` +- `vec.v32.lut2_z(idx: vec.v32, a: vec.v32, b: vec.v32) -> vec.v32` +- `vec.v64.lut2_z(idx: vec.v64, a: vec.v64, b: vec.v64) -> vec.v64` +- `vec.v128.lut2_z(idx: vec.v128, a: vec.v128, b: vec.v128) -> vec.v128` + +```python +def vec.S.lut2_z(idx, a): + result = vec.S.New() + for i in range(vec.S.length): + if idx[i] < vec.S.length: + result[i] = a[idx[i]] + elif idx[i] < 2*vec.S.length: + result[i] = b[idx[i] - vec.S.length] + else: + result[i] = 0 + return result +``` +### LUT2 merge + +Gets elements from `a` and `b` located at the index specified by `idx`. +If the index is lower than length, elements are taken from `a`, if index is between length and 2 * length, elements are taken from `b`. +Elements whose index is out of bounds are taken from fallback. + +- `vec.v8.lut2_m(idx: vec.v8, a: vec.v8, b: vec.v8, fallback: vec.v8) -> vec.v8` +- `vec.v16.lut2_m(idx: vec.v16, a: vec.v16, b: vec.v16, fallback: vec.v16) -> vec.v16` +- `vec.v32.lut2_m(idx: vec.v32, a: vec.v32, b: vec.v32, fallback: vec.v32) -> vec.v32` +- `vec.v64.lut2_m(idx: vec.v64, a: vec.v64, b: vec.v64, fallback: vec.v64) -> vec.v64` +- `vec.v128.lut2_m(idx: vec.v128, a: vec.v128, b: vec.v128, fallback: vec.v128) -> vec.v128` + +```python +def vec.S.lut2_m(idx, a, b, fallback): + result = vec.S.New() + for i in range(vec.S.length): + if idx[i] < vec.S.length: + result[i] = a[idx[i]] + elif idx[i] < 2*vec.S.length: + result[i] = b[idx[i] - vec.S.length] + else: + result[i] = fallback[i] + return result +``` + ### V128 shuffle Applies shuffle to each v128 of the vector. @@ -1088,7 +1140,7 @@ def vec.S.cast_T(m): ### Mask to vec -Active lanes are to `-1` (all one bits), and inactive lanes are set to 0. +Active lanes are set to `-1` (all one bits), and inactive lanes are set to `0`. - `vec.v8.convert_m8(m: vec.m8) -> vec.v8` - `vec.v16.convert_m16(m: vec.m16) -> vec.v16` From 764d69c3d88be73602118be44e27a0593b6106a8 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 12 Jun 2021 14:09:54 +0200 Subject: [PATCH 4/6] Fixed typos + fixed shift --- proposals/flexible-vectors/README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index ff44a05..7fd073b 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -178,7 +178,7 @@ def vec.S.index_CMP(x, n): ### Mask index first Returns the index of the first active lane. -If there is no active lane, the index of the last lane is returned. +If there is no active lane, the length of the vector is returned. - `vec.m8.index_first(m: vec.m8) -> i32` - `vec.m16.index_first(m: vec.m16) -> i32` @@ -191,13 +191,13 @@ def vec.S.index_first(m): for i in range(vec.S.length): if m[i]: return i - return vec.S.length - 1 + return vec.S.length ``` ### Mask index last Returns the index of the last active lane. -If there is no active lane, the index of the first lane is returned. +If there is no active lane, -1 is returned. - `vec.m8.index_last(m: vec.m8) -> i32` - `vec.m16.index_last(m: vec.m16) -> i32` @@ -207,7 +207,7 @@ If there is no active lane, the index of the first lane is returned. ```python def vec.S.index_last(m): - idx = 0 + idx = -1 for i in range(vec.S.length): if m[i]: idx = i @@ -758,7 +758,7 @@ def vec.S.concat(m, a, b): ### Lane shift Concats the 2 input vector to form a single double-width vector. -Shifts this double-width vector by `n` lane to the left (to LSB). +Shifts this double-width vector by `n` lane to the right (to LSB). Extracts the lower half of the shifted vector. `n` is interpreted modulo the length of the vector. @@ -773,10 +773,10 @@ Extracts the lower half of the shifted vector. def vec.S.lane_shift(a, b, n): result = vec.S.New() n = n % vec.S.length - for i in range(0, n): + for i in range(0, vec.S.length - n): result[i] = a[i + n] - for i in range(n, vec.S.length): - result[i] = b[i - n] + for i in range(vec.S.length - n, vec.S.length): + result[i] = b[i - (vec.S.length - n)] return result ``` @@ -789,11 +789,11 @@ Extracts even elements from both input and interleaves them. - `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32` - `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64` - `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128` -- `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8` -- `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16` -- `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32` -- `vec.m64.interleave_even(a: vec.m64, b: mec.m64) -> vec.m64` -- `vec.m128.interleave_even(a: vec.m128, b: mec.m128) -> vec.m128` +- `vec.m8.interleave_even(a: vec.m8, b: vec.m8) -> vec.m8` +- `vec.m16.interleave_even(a: vec.m16, b: vec.m16) -> vec.m16` +- `vec.m32.interleave_even(a: vec.m32, b: vec.m32) -> vec.m32` +- `vec.m64.interleave_even(a: vec.m64, b: vec.m64) -> vec.m64` +- `vec.m128.interleave_even(a: vec.m128, b: vec.m128) -> vec.m128` ```python From 78d967de5a751949da2f6213d183bce05a43b47d Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Mon, 14 Jun 2021 19:27:18 +0200 Subject: [PATCH 5/6] Fixed splat --- proposals/flexible-vectors/README.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index 7fd073b..76e0434 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -319,16 +319,15 @@ Inactive elements are not stored ### Splat scalar -`idx` is interpreted modulo the length of the vector. - -- `vec.i8.splat(v: vec.v8, x: i32) -> vec.v8` -- `vec.i16.splat(v: vec.v16, x: i32) -> vec.v16` -- `vec.i32.splat(v: vec.v32, x: i32) -> vec.v32` -- `vec.f32.splat(v: vec.v32, x: f32) -> vec.32` -- `vec.i64.splat(v: vec.v64, x: i64) -> vec.v64` -- `vec.f64.splat(v: vec.v64, x: f64) -> vec.v64` -- `vec.v128.splat(v: vec.v128, x: v128) -> vec.v128` - +For `vec.i8.splat` and `vec.i16.splat`, `x` is truncated to 8 and 16 bits respectively. + +- `vec.i8.splat(x: i32) -> vec.v8` +- `vec.i16.splat(x: i32) -> vec.v16` +- `vec.i32.splat(x: i32) -> vec.v32` +- `vec.f32.splat(x: f32) -> vec.v32` +- `vec.i64.splat(x: i64) -> vec.v64` +- `vec.f64.splat(x: f64) -> vec.v64` +- `vec.v128.splat(x: v128) -> vec.v128` ### Extract lane From 61dcf1238f92840e974f86ba73a7569b29c04365 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Mon, 14 Jun 2021 19:43:50 +0200 Subject: [PATCH 6/6] Clarified addition and CMP in vec.mX.index_CMP --- proposals/flexible-vectors/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md index 76e0434..b665f71 100644 --- a/proposals/flexible-vectors/README.md +++ b/proposals/flexible-vectors/README.md @@ -158,6 +158,8 @@ def mask.S.none(): Returns a mask whose active lanes satisfy `(x + laneIdx) CMP n` CMP one of the following: `eq`, `ne`, `lt`, `le`, `gt`, `ge` +The addition and the comparison are done signed, with infinite precision. + - `vec.m8.index_CMP(x: i32, n: i32) -> vec.m8` - `vec.m16.index_CMP(x: i32, n: i32) -> vec.m16` - `vec.m32.index_CMP(x: i32, n: i32) -> vec.m32`