From f510205c735b598af1bdc0a379af13d82d9f77e8 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Sat, 27 Feb 2021 17:32:07 +0100
Subject: [PATCH 1/6] SVE-like flexible vectors

---
 proposals/flexible-vectors/FlexibleVectors.md |  832 ------------
 .../FlexibleVectorsSecondTier.md              |   58 -
 .../FlexibleVectorsThirdTier.md               |   47 -
 proposals/flexible-vectors/README.md          | 1129 ++++++++++++++++-
 4 files changed, 1124 insertions(+), 942 deletions(-)
 delete mode 100644 proposals/flexible-vectors/FlexibleVectors.md
 delete mode 100644 proposals/flexible-vectors/FlexibleVectorsSecondTier.md
 delete mode 100644 proposals/flexible-vectors/FlexibleVectorsThirdTier.md

diff --git a/proposals/flexible-vectors/FlexibleVectors.md b/proposals/flexible-vectors/FlexibleVectors.md
deleted file mode 100644
index 614cd45..0000000
--- a/proposals/flexible-vectors/FlexibleVectors.md
+++ /dev/null
@@ -1,832 +0,0 @@
-Flexible vectors overview
-=========================
-
-The goal of this proposal is to provide flexible vector instructions for
-WebAssembly as a way to bridge the gap between existing SIMD instruction sets
-available on various platforms. More specifically, this proposal aims to enable
-better use processing capabilities of existing SIMD hardware and bring
-performance of vector operaions available in WebAssembly closer to native.
-`simd128` proposal already identified operations that would commonly work on
-platforms that are important to WebAssembly, this proposal is attempting to
-extend the same operations to work with variable vector lengths.
-
-The rest of this document contains instructions that have uncontroversial
-lowering on all platforms. There are two more tiers of instructions: second
-tier containing instructions with more complex lowering without effect on other
-instructions, and third containing instructions affecting execution semantics
-or lowering of other instructions.
-
-See [FlexibleVectorsSecondTier.md](FlexibleVectorsSecondTier.md) for the second
-tier and [FlexibleVectorsThirdTier.md](FlexibleVectorsThirdTier.md) for the
-third tier.
-
-## Types
-
-Proposal introduces the following vector types:
-
-- `vec.i8` : 8-bit integer lanes
-- `vec.i16`: 16-bit integer lanes
-- `vec.i32`: 32-bit integer lanes
-- `vec.i64`: 64-bit integer lanes
-- `vec.f32`: single precision floating point lanes
-- `vec.f64`: double precision floating point lanes
-
-### Lane division interpretation
-
-In semantic pseudocode `S` is the particular vector type, `S.LaneBits` is the
-size of the lane in bits, `S.Lanes` is the number of lanes, which is dynamic.
-
-|    S      | S.LaneBits |
-|-----------|-----------:|
-| `vec.i8`  |          8 |
-| `vec.i16` |         16 |
-| `vec.i32` |         32 |
-| `vec.f32` |         32 |
-| `vec.i64` |         64 |
-| `vec.f64` |         64 |
-
-### Restrictions
-
-Lane values are intended to be handled exactly like in `simd128` proposal, with
-the following differences applying to overall types:
-
-- Runtime sets maximum vector length for every type
-- Number of lanes is set separately for different lane sizes
-- Vectors with different lane size are not immediately interoperable
-
-## Immediate operands
-
-_TBD_ value range, depends on instruction encoding.
-
-- `ImmLaneIdxV8`: lane index for 8-bit lanes
-- `ImmLaneIdxV16`: lane index for 16-bit lanes
-- `ImmLaneIdxV32`: lane index for 32-bit lanes
-- `ImmLaneIdxV64`: lane index for 64-bit lanes
-
-## Operations
-
-Completely new operations introduced in this proposal are the operations that
-provide interface to vector length.
-
-### Vector length
-
-Querying length of supported vector:
-
-- `vec.i8.length -> i32`
-- `vec.i16.length -> i32`
-- `vec.i32.length -> i32`
-- `vec.i64.length -> i32`
-- `vec.f32.length -> i32`
-- `vec.f64.length -> i32`
-
-### Constructing vector values
-
-Create vector with identical lanes:
-
-- `vec.i8.splat(x:i32) -> vec.i8`
-- `vec.i16.splat(x:i32) -> vec.i16`
-- `vec.i32.splat(x:i32) -> vec.i32`
-- `vec.i64.splat(x:i64) -> vec.i64`
-- `vec.f32.splat(x:f32) -> vec.f32`
-- `vec.f64.splat(x:f64) -> vec.f64`
-
-Construct vector with `x` replicated to all lanes:
-
-```python
-def S.splat(x):
-    result = S.New()
-    for i in range(S.Lanes):
-        result[i] = x
-    return result
-```
-
-### Accessing lanes
-
-#### Extract lane as a scalar
-
-- `vec.i8.extract_lane_s(a: vec.i8, imm: ImmLaneIdxV8) -> i32`
-- `vec.i8.extract_lane_u(a: vec.i8, imm: ImmLaneIdxV8) -> i32`
-- `vec.i16.extract_lane_s(a: vec.i16, imm: ImmLaneIdxV16) -> i32`
-- `vec.i16.extract_lane_u(a: vec.i16, imm: ImmLaneIdxV16) -> i32`
-- `vec.i32.extract_lane(a: vec.i32, imm: ImmLaneIdxV32) -> i32`
-- `vec.i64.extract_lane(a: vec.i64, imm: ImmLaneIdxV64) -> i64`
-- `vec.f32.extract_lane(a: vec.f32, imm: ImmLaneIdxV32) -> f32`
-- `vec.f64.extract_lane(a: vec.f64, imm: ImmLaneIdxV64) -> f64`
-
-Extract the scalar value of lane specified in the immediate mode operand `imm`
-in `a`. The `{interpretation}.extract_lane{_s}{_u}` instructions are encoded
-with one immediate byte providing the index of the lane to extract.
-
-```python
-def S.extract_lane(a, i):
-    return a[i]
-```
-
-The `_s` and `_u` variants will sign-extend or zero-extend the lane value to
-`i32` respectively.
-
-#### Replace lane value
-
-- `vec.i8.replace_lane(a: vec.i8, imm: ImmLaneIdxV8, x: i32) -> vec.i8`
-- `vec.i16.replace_lane(a: vec.i16, imm: ImmLaneIdxV16, x: i32) -> vec.i16`
-- `vec.i32.replace_lane(a: vec.i32, imm: ImmLaneIdxV32, x: i32) -> vec.i32`
-- `vec.i64.replace_lane(a: vec.i64, imm: ImmLaneIdxV64, x: i64) -> vec.i64`
-- `vec.f32.replace_lane(a: vec.f32, imm: ImmLaneIdxV32, x: f32) -> vec.f32`
-- `vec.f64.replace_lane(a: vec.f64, imm: ImmLaneIdxV64, x: f64) -> vec.f64`
-
-Return a new vector with lanes identical to `a`, except for the lane specified
-in the immediate mode operand `imm` which has the value `x`. The
-`{interpretation}.replace_lane` instructions are encoded with an immediate byte 
-providing the index of the lane the value of which is to be replaced.
-
-```python
-def S.replace_lane(a, i, x):
-    result = S.New()
-    for j in range(S.Lanes):
-        result[j] = a[j]
-    result[i] = x
-    return result
-```
-
-The input lane value, `x`, is interpreted the same way as for the splat
-instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored.
-
-### Shuffles
-
-#### Left lane-wise shift by scalar
-
-* `vec.i8.lshl(a: vec.i8, x: i32) -> vec.i8`
-* `vec.i16.lshl(a: vec.i16, x: i32) -> vec.i16`
-* `vec.i32.lshl(a: vec.i32, x: i32) -> vec.i32`
-* `vec.i64.lshl(a: vec.i64, x: i32) -> vec.i64`
-
-Returns a new vector with lanes selected from the lanes of the two input
-vectors `a` and `b` by shifting lanes of the original to the left by the amount
-specified in the integer argument and shifting zero values in.
-
-```python
-def S.lshl(a, x):
-    result = S.New()
-    for i in range(S.Lanes):
-        if i < x:
-            result[i] = 0
-        else:
-            result[i] = a[i - x]
-    return result
-```
-
-#### Right lane-wise shift by scalar
-
-* `vec.i8.lshr(a: vec.i8, x: i32) -> vec.i8`
-* `vec.i16.lshr(a: vec.i16, x: i32) -> vec.i16`
-* `vec.i32.lshr(a: vec.i32, x: i32) -> vec.i32`
-* `vec.i64.lshr(a: vec.i64, x: i32) -> vec.i64`
-
-Returns a new vector with lanes selected from the lanes of the two input
-vectors `a` and `b` by shifting lanes of the original to the right by the
-amount specified in the integer argument and shifting zero values in.
-
-```python
-def S.lshr(a, x):
-    result = S.New()
-    for i in range(S.Lanes):
-        if i < S.Lanes - x:
-            result[i] = a[i + x]
-        else:
-            result[i] = 0
-    return result
-```
-
-### Integer arithmetic
-
-Wrapping integer arithmetic discards the high bits of the result.
-
-```python
-def S.Reduce(x):
-    bitmask = (1 << S.LaneBits) - 1
-    return x & bitmask
-```
-
-Integer division operation is omitted to be compatible with 128-bit SIMD.
-
-#### Integer addition
-
-- `vec.i8.add(a: vec.i8, b: vec.i8) -> vec.i8`
-- `vec.i16.add(a: vec.i16, b: vec.i16) -> vec.i16`
-- `vec.i32.add(a: vec.i32, b: vec.i32) -> vec.i32`
-- `vec.i64.add(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise wrapping integer addition:
-
-```python
-def S.add(a, b):
-    def add(x, y):
-        return S.Reduce(x + y)
-    return S.lanewise_binary(add, a, b)
-```
-
-#### Integer subtraction
-
-- `vec.i8.sub(a: vec.i8, b: vec.i8) -> vec.i8`
-- `vec.i16.sub(a: vec.i16, b: vec.i16) -> vec.i16`
-- `vec.i32.sub(a: vec.i32, b: vec.i32) -> vec.i32`
-- `vec.i64.sub(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise wrapping integer subtraction:
-
-```python
-def S.sub(a, b):
-    def sub(x, y):
-        return S.Reduce(x - y)
-    return S.lanewise_binary(sub, a, b)
-```
-
-#### Integer multiplication
-
-- `vec.i8.mul(a: vec.i8, b: vec.i8) -> vec.i8`
-- `vec.i16.mul(a: vec.i16, b: vec.i16) -> vec.i16`
-- `vec.i32.mul(a: vec.i32, b: vec.i32) -> vec.i32`
-- `vec.i64.mul(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise wrapping integer multiplication:
-
-```python
-def S.mul(a, b):
-    def mul(x, y):
-        return S.Reduce(x * y)
-    return S.lanewise_binary(mul, a, b)
-```
-
-#### Integer negation
-
-- `vec.i8.neg(a: vec.i8, b: vec.i8) -> vec.i8`
-- `vec.i16.neg(a: vec.i16, b: vec.i16) -> vec.i16`
-- `vec.i32.neg(a: vec.i32, b: vec.i32) -> vec.i32`
-- `vec.i64.neg(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise wrapping integer negation. In wrapping arithmetic, `y = -x` is the
-unique value such that `x + y == 0`.
-
-```python
-def S.neg(a):
-    def neg(x):
-        return S.Reduce(-x)
-    return S.lanewise_unary(neg, a)
-```
-
-### Saturating integer arithmetic
-
-Saturating integer arithmetic behaves differently on signed and unsigned lanes.
-
-```python
-def S.SignedSaturate(x):
-    if x < S.Smin:
-        return S.Smin
-    if x > S.Smax:
-        return S.Smax
-    return x
-
-def S.UnsignedSaturate(x):
-    if x < 0:
-        return 0
-    if x > S.Umax:
-        return S.Umax
-    return x
-```
-
-#### Saturating integer addition
-
-* `vec.i8.add_sat_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.add_sat_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.add_sat_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.add_sat_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.add_sat_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.add_sat_u(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise saturating addition:
-
-```python
-def S.add_sat_s(a, b):
-    def addsat(x, y):
-        return S.SignedSaturate(x + y)
-    return S.lanewise_binary(addsat, S.AsSigned(a), S.AsSigned(b))
-
-def S.add_sat_u(a, b):
-    def addsat(x, y):
-        return S.UnsignedSaturate(x + y)
-    return S.lanewise_binary(addsat, S.AsUnsigned(a), S.AsUnsigned(b))
-```
-
-#### Saturating integer subtraction
-
-* `vec.i8.sub_sat_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.sub_sat_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.sub_sat_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.sub_sat_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.sub_sat_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.sub_sat_u(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise saturating subtraction:
-
-```python
-def S.sub_sat_s(a, b):
-    def subsat(x, y):
-        return S.SignedSaturate(x - y)
-    return S.lanewise_binary(subsat, S.AsSigned(a), S.AsSigned(b))
-
-def S.sub_sat_u(a, b):
-    def subsat(x, y):
-        return S.UnsignedSaturate(x - y)
-    return S.lanewise_binary(subsat, S.AsUnsigned(a), S.AsUnsigned(b))
-```
-
-#### Lane-wise integer minimum
-
-* `vec.i8.min_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.min_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.min_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.min_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.min_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.min_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.min_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.min_u(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Compares lane-wise signed/unsigned integers, and returns the minimum of
-each pair.
-
-```python
-def S.min(a, b):
-    return S.lanewise_binary(min, a, b)
-```
-
-#### Lane-wise integer maximum
-
-* `vec.i8.max_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.max_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.max_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.max_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.max_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.max_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.max_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.max_u(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Compares lane-wise signed/unsigned integers, and returns the maximum of
-each pair.
-
-```python
-def S.max(a, b):
-    return S.lanewise_binary(max, a, b)
-```
-
-#### Lane-wise integer rounding average
-
-* `vec.i8.avgr_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.avgr_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.avgr_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.avgr_u(a: vec.i64, b: vec.i64) -> vec.i64`
-
-Lane-wise rounding average:
-
-```python
-def S.RoundingAverage(x, y):
-    return (x + y + 1) // 2
-
-def S.avgr_u(a, b):
-    return S.lanewise_binary(S.RoundingAverage, S.AsUnsigned(a), S.AsUnsigned(b))
-```
-
-#### Lane-wise integer absolute value
-
-* `vec.i8.abs(a: vec.i8) -> vec.i8`
-* `vec.i16.abs(a: vec.i16) -> vec.i16`
-* `vec.i32.abs(a: vec.i32) -> vec.i32`
-* `vec.i64.abs(a: vec.i64) -> vec.i64`
-
-Lane-wise wrapping absolute value.
-
-```python
-def S.abs(a):
-    return S.lanewise_unary(abs, S.AsSigned(a))
-```
-
-
-### Bit shifts
-
-#### Left shift by scalar
-
-* `vec.i8.shl(a: vec.i8, y: i32) -> vec.i8`
-* `vec.i16.shl(a: vec.i16, y: i32) -> vec.i16`
-* `vec.i32.shl(a: vec.i32, y: i32) -> vec.i32`
-* `vec.i64.shl(a: vec.i64, y: i32) -> vec.i64`
-
-Shift the bits in each lane to the left by the same amount. The shift count is
-taken modulo lane width:
-
-```python
-def S.shl(a, y):
-    # Number of bits to shift: 0 .. S.LaneBits - 1.
-    amount = y mod S.LaneBits
-    def shift(x):
-        return S.Reduce(x << amount)
-    return S.lanewise_unary(shift, a)
-```
-
-#### Right shift by scalar
-
-* `vec.i8.shr_s(a: vec.i8, y: i32) -> vec.i8`
-* `vec.i8.shr_u(a: vec.i8, y: i32) -> vec.i8`
-* `vec.i16.shr_s(a: vec.i16, y: i32) -> vec.i16`
-* `vec.i16.shr_u(a: vec.i16, y: i32) -> vec.i16`
-* `vec.i32.shr_s(a: vec.i32, y: i32) -> vec.i32`
-* `vec.i32.shr_u(a: vec.i32, y: i32) -> vec.i32`
-* `vec.i64.shr_s(a: vec.i64, y: i32) -> vec.i64`
-* `vec.i64.shr_u(a: vec.i64, y: i32) -> vec.i64`
-
-Shift the bits in each lane to the right by the same amount. The shift count is
-taken modulo lane width.  This is an arithmetic right shift for the `_s`
-variants and a logical right shift for the `_u` variants.
-
-```python
-def S.shr_s(a, y):
-    # Number of bits to shift: 0 .. S.LaneBits - 1.
-    amount = y mod S.LaneBits
-    def shift(x):
-        return x >> amount
-    return S.lanewise_unary(shift, S.AsSigned(a))
-
-def S.shr_u(a, y):
-    # Number of bits to shift: 0 .. S.LaneBits - 1.
-    amount = y mod S.LaneBits
-    def shift(x):
-        return x >> amount
-    return S.lanewise_unary(shift, S.AsUnsigned(a))
-```
-
-
-### Bitwise operations
-
-#### Bitwise logic
-
-* `vec.i8.and(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.or(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.xor(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.not(a: vec.i8) -> vec.i8`
-
-The logical operations defined on the scalar integer types are also available
-on the `v128` type where they operate bitwise the same way C's `&`, `|`, `^`,
-and `~` operators work on an `unsigned` type.
-
-#### Bitwise AND-NOT
-
-* `vec.i8.andnot(a: vec.i8, b: vec.i8) -> vec.i8`
-
-Bitwise AND of bits of `a` and the logical inverse of bits of `b`. This operation is equivalent to `vec.i8.and(a, vec.i8.not(b))`.
-
-#### Bitwise select
-
-* `vec.i8.bitselect(v1: vec.i8, v2: vec.i8, c: vec.i8) -> vec.i8`
-
-Use the bits in the control mask `c` to select the corresponding bit from `v1`
-when 1 and `v2` when 0.
-This is the same as `vec.i8.or(vec.i8.and(v1, c), vec.i8.and(v2, vec.i8.not(c)))`.
-
-Note that the normal WebAssembly `select` instruction also works with vector
-types. It selects between two whole vectors controlled by a single scalar value,
-rather than selecting bits controlled by a control mask vector.
-
-### Boolean horizontal reductions
-
-These operations reduce all the lanes of an integer vector to a single scalar
-0 or 1 value. A lane is considered "true" if it is non-zero.
-
-#### Any lane true
-
-* `vec.i8.any_true(a: vec.i8) -> i32`
-* `vec.i16.any_true(a: vec.i16) -> i32`
-* `vec.i32.any_true(a: vec.i32) -> i32`
-
-These functions return 1 if any lane in `a` is non-zero, 0 otherwise.
-
-```python
-def S.any_true(a):
-    for i in range(S.Lanes):
-        if a[i] != 0:
-            return 1
-    return 0
-```
-
-#### All lanes true
-
-* `vec.i8.all_true(a: vec.i8) -> i32`
-* `vec.i16.all_true(a: vec.i16) -> i32`
-* `vec.i32.all_true(a: vec.i32) -> i32`
-
-These functions return 1 if all lanes in `a` are non-zero, 0 otherwise.
-
-```python
-def S.all_true(a):
-    for i in range(S.Lanes):
-        if a[i] == 0:
-            return 0
-    return 1
-```
-
-### Comparisons
-
-The comparison operations all compare two vectors lane-wise, and produce a mask
-vector with the same number of lanes as the input interpretation where the bits
-in each lane are `0` for `false` and all ones for `true`.
-
-<details>
-  <summary>Implementation notes</summary>
-
-  Some classes of comparison operations (for example in AVX512 and SVE) return
-  a mask while others return a vector containing results in its lanes. This
-  section mightn need to be tuned.
-
-</details>
-
-#### Equality
-
-* `vec.i8.eq(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.eq(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.eq(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.eq(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.eq(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.eq(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Integer equality is independent of the signed/unsigned interpretation. Floating
-point equality follows IEEE semantics, so a NaN lane compares not equal with
-anything, including itself, and +0.0 is equal to -0.0:
-
-```python
-def S.eq(a, b):
-    def eq(x, y):
-        return x == y
-    return S.lanewise_comparison(eq, a, b)
-```
-
-#### Non-equality
-
-* `vec.i8.ne(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.ne(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.ne(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.ne(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.ne(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.ne(a: vec.f64, b: vec.f64) -> vec.f64`
-
-The `ne` operations produce the inverse of their `eq` counterparts:
-
-```python
-def S.ne(a, b):
-    def ne(x, y):
-        return x != y
-    return S.lanewise_comparison(ne, a, b)
-```
-
-#### Less than
-
-* `vec.i8.lt_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.lt_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.lt_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.lt_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.lt_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.lt_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.lt_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.lt_u(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.lt(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.lt(a: vec.f64, b: vec.f64) -> vec.f64`
-
-#### Less than or equal
-
-* `vec.i8.le_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.le_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.le_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.le_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.le_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.le_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.le_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.le_u(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.le(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.le(a: vec.f64, b: vec.f64) -> vec.f64`
-
-#### Greater than
-
-* `vec.i8.gt_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.gt_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.gt_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.gt_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.gt_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.gt_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.gt_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.gt(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.gt(a: vec.f64, b: vec.f64) -> vec.f64`
-
-#### Greater than or equal
-
-* `vec.i8.ge_s(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i8.ge_u(a: vec.i8, b: vec.i8) -> vec.i8`
-* `vec.i16.ge_s(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i16.ge_u(a: vec.i16, b: vec.i16) -> vec.i16`
-* `vec.i32.ge_s(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i32.ge_u(a: vec.i32, b: vec.i32) -> vec.i32`
-* `vec.i64.ge_s(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.i64.ge_u(a: vec.i64, b: vec.i64) -> vec.i64`
-* `vec.f32.ge(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.ge(a: vec.f64, b: vec.f64) -> vec.f64`
-
-#### Load and store
-
-- `vec.v8.load(memarg) -> vec.v8`
-- `vec.v16.load(memarg) -> vec.v16`
-- `vec.v32.load(memarg) -> vec.v32`
-- `vec.v64.load(memarg) -> vec.v64`
-
-Load a vector from the given heap address.
-
-- `vec.v8.store(memarg, data:vec.v8)`
-- `vec.v16.store(memarg, data:vec.v16)`
-- `vec.v32.store(memarg, data:vec.v32)`
-- `vec.v64.store(memarg, data:vec.v64)`
-
-Store a vector to the given heap address.
-
-### Floating-point sign bit operations
-
-These floating point operations are simple manipulations of the sign bit. No
-changes are made to the exponent or trailing significand bits, even for NaN
-inputs.
-
-#### Negation
-
-* `vec.f32.neg(a: vec.f32) -> vec.f32`
-* `vec.f64.neg(a: vec.f64) -> vec.f64`
-
-Apply the IEEE `negate(x)` function to each lane. This simply inverts the sign
-bit, preserving all other bits.
-
-```python
-def S.neg(a):
-    return S.lanewise_unary(ieee.negate, a)
-```
-
-#### Floating-point absolute value
-
-* `vec.f32.abs(a: vec.f32) -> vec.f32`
-* `vec.f64.abs(a: vec.f64) -> vec.f64`
-
-Apply the IEEE `abs(x)` function to each lane. This simply clears the sign bit,
-preserving all other bits.
-
-```python
-def S.abs(a):
-    return S.lanewise_unary(ieee.abs, a)
-```
-
-### Floating-point min and max
-
-#### Pseudo-minimum
-
-* `vec.f32.pmin(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.pmin(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise minimum value, defined as `b < a ? b : a`.
-
-#### Pseudo-maximum
-
-* `vec.f32.pmax(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.pmax(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise maximum value, defined as `a < b ? b : a`.
-
-### Floating-point arithmetic
-
-The floating-point arithmetic operations are all lane-wise versions of the
-existing scalar WebAssembly operations.
-
-#### Addition
-
-- `vec.f32.add(a: vec.f32, b: vec.f32) -> vec.f32`
-- `vec.f64.add(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise IEEE `addition`.
-
-#### Subtraction
-
-- `vec.f32.sub(a: vec.f32, b: vec.f32) -> vec.f32`
-- `vec.f64.sub(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise IEEE `subtraction`.
-
-#### Division
-
-- `vec.f32.div(a: vec.f32, b: vec.f32) -> vec.f32`
-- `vec.f64.div(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise IEEE `division`.
-
-#### Multiplication
-
-- `vec.f32.mul(a: vec.f32, b: vec.f32) -> vec.f32`
-- `vec.f64.mul(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise IEEE `multiplication`.
-
-#### Square root
-
-- `vec.f32.sqrt(a: vec.f32, b: vec.f32) -> vec.f32`
-- `vec.f64.sqrt(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise IEEE `squareRoot`.
-
-
-### Conversions
-
-#### Integer to floating point
-
-* `vec.f32.convert_s(a: vec.i32) -> vec.f32`
-* `vec.f64.convert_s(a: vec.i64) -> vec.f64`
-
-Lane-wise conversion from integer to floating point. Some integer values will be
-rounded.
-
-#### Integer to integer narrowing
-
-* `vec.i16.narrow_s(a: vec.i16, b: vec.i16) -> vec.i8`
-* `vec.i16.narrow_u(a: vec.i16, b: vec.i16) -> vec.i8`
-* `vec.i32.narrow_s(a: vec.i32, b: vec.i32) -> vec.i16`
-* `vec.i32.narrow_u(a: vec.i32, b: vec.i32) -> vec.i16`
-* `vec.i64.narrow_s(a: vec.i64, b: vec.i64) -> vec.i32`
-* `vec.i64.narrow_u(a: vec.i64, b: vec.i64) -> vec.i32`
-
-Converts two input vectors into a smaller lane vector by narrowing each lane,
-signed or unsigned. The signed narrowing operation will use signed saturation
-to handle overflow, 0x7f or 0x80 for i8x16, the unsigned narrowing operation
-will use unsigned saturation to handle overflow, 0x00 or 0xff for i8x16.
-Regardless of the whether the operation is signed or unsigned, the input lanes
-are interpreted as signed integers.
-
-```python
-def S.narrow_T_s(a, b):
-    result = S.New()
-    for i in range(T.Lanes):
-        result[i] = S.SignedSaturate(a[i])
-    for i in range(T.Lanes):
-        result[T.Lanes + i] = S.SignedSaturate(b[i])
-    return result
-
-def S.narrow_T_u(a, b):
-    result = S.New()
-    for i in range(T.Lanes):
-        result[i] = S.UnsignedSaturate(a[i])
-    for i in range(T.Lanes):
-        result[T.Lanes + i] = S.UnsignedSaturate(b[i])
-    return result
-```
-
-#### Integer to integer widening
-
-* `vec.i8.widen_low_s(a: vec.i8) -> vec.i16`
-* `vec.i8.widen_high_s(a: vec.i8) -> vec.i16`
-* `vec.i8.widen_low_u(a: vec.i8) -> vec.i16`
-* `vec.i8.widen_high_u(a: vec.i8) -> vec.i16`
-* `vec.i16.widen_low_s(a: vec.i16) -> vec.i32`
-* `vec.i16.widen_high_s(a: vec.i16) -> vec.i32`
-* `vec.i16.widen_low_u(a: vec.i16) -> vec.i32`
-* `vec.i16.widen_high_u(a: vec.i16) -> vec.i32`
-* `vec.i32.widen_low_s(a: vec.i32) -> vec.i64`
-* `vec.i32.widen_high_s(a: vec.i32) -> vec.i64`
-* `vec.i32.widen_low_u(a: vec.i32) -> vec.i64`
-* `vec.i32.widen_high_u(a: vec.i32) -> vec.i64`
-
-Converts low or high half of the smaller lane vector to a larger lane vector,
-sign extended or zero (unsigned) extended.
-
-```python
-def S.widen_low_T(ext, a):
-    result = S.New()
-    for i in range(S.Lanes):
-        result[i] = ext(a[i])
-
-def S.widen_high_T(ext, a):
-    result = S.New()
-    for i in range(S.Lanes):
-        result[i] = ext(a[S.Lanes + i])
-
-def S.widen_low_T_s(a):
-    return S.widen_low_T(Sext, a)
-
-def S.widen_high_T_s(a):
-    return S.widen_high_T(Sext, a)
-
-def S.widen_low_T_u(a):
-    return S.widen_low_T(Zext, a)
-
-def S.widen_high_T_u(a):
-    return S.widen_high_T(Zext, a)
-```
-
diff --git a/proposals/flexible-vectors/FlexibleVectorsSecondTier.md b/proposals/flexible-vectors/FlexibleVectorsSecondTier.md
deleted file mode 100644
index f3fb220..0000000
--- a/proposals/flexible-vectors/FlexibleVectorsSecondTier.md
+++ /dev/null
@@ -1,58 +0,0 @@
-Instructions considered conditionally
-=====================================
-
-This document describes instructions considered conditionally pending
-performance data. The reasons are listed in "implementation notes" sections.
-
-## Operations
-
-### Floating-point min and max
-
-These operations are not part of the IEEE 754-2008 standard. They are lane-wise
-versions of the existing scalar WebAssembly operations.
-
-<details>
-  <summary>Implementation notes</summary>
-
-  NaN queting required for these operation is expensive on x86-based platforms.
-  See [WebAssembly/simd#186](https://github.com/WebAssembly/simd/issues/186).
-
-</details>
-
-#### NaN-propagating minimum
-
-* `vec.f32.min(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.min(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise minimum value, propagating NaNs.
-
-#### NaN-propagating maximum
-
-* `vec.f32.max(a: vec.f32, b: vec.f32) -> vec.f32`
-* `vec.f64.max(a: vec.f64, b: vec.f64) -> vec.f64`
-
-Lane-wise maximum value, propagating NaNs.
-
-### Conversions
-
-#### Integer to floating point
-
-* `vec.f32.convert_u(a: vec.i32) -> vec.f32`
-* `vec.f64.convert_u(a: vec.i64) -> vec.f64`
-
-Lane-wise conversion from integer to floating point. Some integer values will be
-rounded.
-
-#### Floating point to integer with saturation
-
-* `vec.i32.trunc_sat_s(a: vec.f32) -> vec.i32`
-* `vec.i32.trunc_sat_u(a: vec.f32) -> vec.i32`
-* `vec.i64.trunc_sat_s(a: vec.f64) -> vec.i64`
-* `vec.i64.trunc_sat_u(a: vec.f64) -> vec.i64`
-
-Lane-wise saturating conversion from floating point to integer using the IEEE
-`convertToIntegerTowardZero` function. If any input lane is a NaN, the
-resulting lane is 0. If the rounded integer value of a lane is outside the
-range of the destination type, the result is saturated to the nearest
-representable integer value.
-
diff --git a/proposals/flexible-vectors/FlexibleVectorsThirdTier.md b/proposals/flexible-vectors/FlexibleVectorsThirdTier.md
deleted file mode 100644
index 425c195..0000000
--- a/proposals/flexible-vectors/FlexibleVectorsThirdTier.md
+++ /dev/null
@@ -1,47 +0,0 @@
-Instructions considered conditionally
-=====================================
-
-This document describes instructions considered conditionally pending
-performance data with implications for other instructions in the proposal. The
-reasons are listed in "implementation notes" sections.
-
-## Operations
-
-### Setting vector length
-
-<details>
-  <summary>Implementation notes</summary
-
-  Dynamic vector length can be implemented using masks, but would introduce
-  additional overhead, and would be extremely expensive for instruction sets
-  without masks. Additionaly it introduces global state that would affect other
-  vector instructions.
-
-</details>
-
-
-- 8-bit lanes
-  - `vec.i8.set_length(len: i32) -> i32`
-  - `vec.i8.set_length_imm(imm: ImmLaneIdx8) -> i32`
-- 16-bit lanes
-  - `vec.i16.set_length(len: i32) -> i32`
-  - `vec.i16.set_length_imm(imm: ImmLaneIdx16) -> i32`
-- 32-bit lanes
-  - `vec.i32.set_length(len: i32) -> i32`
-  - `vec.i32.set_length_imm(imm: ImmLaneIdx32) -> i32`
-  - `vec.f32.set_length(len: i32) -> i32`
-  - `vec.f32.set_length_imm(imm: ImmLaneIdx32) -> i32`
-- 64-bit lanes
-  - `vec.i64.set_length(len: i32) -> i32`
-  - `vec.i64.set_length_imm(imm: ImmLaneIdx64) -> i32`
-  - `vec.f64.set_length(len: i32) -> i32`
-  - `vec.f64.set_length_imm(imm: ImmLaneIdx64) -> i32`
-
-The above operations set the number of lanes for corresponding vector type to
-the minimum of supported vector length and the requested length. The length is
-then returned on the stack.
-
-This sets number of lanes for vector operations working on corresponding vector
-types. Setting vector length to zero turns corresponding vector operations
-(aside of set length) into NOPs.
-
diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index 86fd0ce..2a74d54 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -1,7 +1,1126 @@
-## Flexible vectors proposal
+# Concrete model for flexible vectors
 
-- Instruction descriptions
-  - [FlexibleVectors.md](FlexibleVectors.md)
-  - [FlexibleVectorsSecondTier.md](FlexibleVectorsSecondTier.md)
-  - [FlexibleVectorsThirdTier.md](FlexibleVectorsThirdTier.md)
+## Terms
 
+- (WASM) compiler: A compiler reads a program source code (eg: C) and translates it into WASM bytecode.
+- target architecture: the architecture where the WASM bytecode is ran.
+- WASM engine: An engine reads WASM bytecode and translates it into native instructions for the architecture it is currently running on, and runs them.
+- element width: size in bits of the element (`float` is 32-bit wide).
+- SIMD width: size in bits of the hardware SIMD registers.
+- SIMD length: number of elements that can fit within a hardware SIMD register (SIMD width / element width).
+- lane: element within a vector
+
+## General goal
+
+The core of the flexible vectors proposal is to a have a single bytecode that can target multiple architectures with different SIMD width.
+This should be as efficient as possible (ie: it should run as fast as possible).
+In order to do that, flexible vectors should abstract the SIMD width away.
+
+List of potential targets:
+
+| ISA           | SIMD width   |
+|:--------------|:-------------|
+| SSE - SSE4.2  | 128          |
+| AVX - AVX2    | 256          |
+| AVX512 -      | 512          |
+| Altivec - VSX | 128          |
+| Neon          | 128          |
+| SVE           | 128 -  2048  |
+| Risc-V V      | 128 - 65536? |
+
+I propose that SIMD width is accessible by WASM bytecode, and have the following property: SIMD width is a runtime constant, and thus cannot change during the execution of the program.
+Therefore, all vectors manipulated by the program have the exact same width.
+Smaller vectors are handled using masks.
+As a consequence, a WASM compiler cannot assume any particular value, but can assume it will not change.
+A WASM engine could optimize the bytecode for the target architecture (constant folding) where the actual SIMD width is known.
+
+
+We could impose more constraints on the SIMD width:
+
+- It should be a multiple of 128 bits.
+- It is a power of 2? (might be problematic to target SVE)
+- It cannot be wider than 2048? (might be problematic to target Risc-V V)
+
+## SIMD types
+
+Vector types:
+
+- `vec.v8`: vector of 8-bit elements
+- `vec.v16`: vector of 16-bit elements
+- `vec.v32`: vector of 32-bit elements
+- `vec.v64`: vector of 64-bit elements
+- `vec.v128`: vector of 128-bit elements
+
+Mask types:
+
+- `vec.m8`: mask for a vector of 8-bit elements
+- `vec.m16`: mask for a vector of 16-bit elements
+- `vec.m32`: mask for a vector of 32-bit elements
+- `vec.m64`: mask for a vector of 64-bit elements
+- `vec.m128`: mask for a vector of 128-bit elements
+
+Vector types can be interpreted in multiple ways:
+
+| vector type | interpretations                                    |
+|:------------|:---------------------------------------------------|
+| `vec.v8`    | `vec.i8`                                           |
+| `vec.v16`   | `vec.i16`                                          |
+| `vec.v32`   | `vec.i32`, `vec.f32`                               |
+| `vec.v64`   | `vec.i64`, `vec.f64`                               |
+| `vec.v128`  | `vec.v8x16`, `vec.v16x8`, `vec.v32x4`, `vec.v64x2` |
+
+Mask types are not vector types, and their actual representation differ depending on the target architecture.
+In particular, `vec.m8` and `vec.m16` might have different architectural sizes (or actually be the same).
+In a similar fashion, `vec.v8` and `vec.m8` might also have different architectural sizes (or actually be the same).
+
+## Immediate LaneIdx operands
+
+As the vector length is not known at compile time, it does not make much sense to specify lane indices as immediates.
+
+Therefore, the type `i32` will be used to specify a lane index.
+The range of a lane index is no further constrained, and is intepreted modulo the actual vector length.
+
+If we impose an upper bound on the SIMD width, we can further constraint the range of lane indices.
+For instance, a maximum width of 2048 would allow to store a lane index in a 8-bit integer.
+Similarily, a maximum width of 524288 would allow to store a lane index in a 16-bit integer.
+The would not change the instruction encoding as they all take runtime lane indices which are always `i32`.
+
+## Special Operations
+
+### Vector length
+
+Returns the number of elements.
+Returned value will always be the same during the whole execution of the program.
+
+- `vec.v8.length -> i32`
+- `vec.v16.length -> i32`
+- `vec.v32.length -> i32`
+- `vec.v64.length -> i32`
+- `vec.v128.length -> i32`
+
+### Mask count
+
+Returns the number of active lanes of a `mask`
+
+- `vec.m8.count(m: vec.m8) -> i32`
+- `vec.m16.count(m: vec.m16) -> i32`
+- `vec.m32.count(m: vec.m32) -> i32`
+- `vec.m64.count(m: vec.m64) -> i32`
+- `vec.m128.count(m: vec.m128) -> i32`
+
+```python
+def mask.S.count(m):
+    result = 0
+    for i in range(mask.S.length):
+      if m[i]:
+          result += 1
+    return result
+```
+
+### Mask all active
+
+Returns a mask where all lanes are active
+
+- `vec.m8.all -> vec.m8`
+- `vec.m16.all -> vec.m16`
+- `vec.m32.all -> vec.m32`
+- `vec.m64.all -> vec.m64`
+- `vec.m128.all -> vec.m128`
+
+```python
+def mask.S.all():
+    result = mask.S.New()
+    for i in range(mask.S.length):
+        result[i] = 1
+    return result
+```
+
+### Mask None active
+
+Returns a mask where no lane is active
+
+- `vec.m8.none -> vec.m8`
+- `vec.m16.none -> vec.m16`
+- `vec.m32.none -> vec.m32`
+- `vec.m64.none -> vec.m64`
+- `vec.m128.none -> vec.m128`
+
+```python
+def mask.S.none():
+    result = mask.S.New()
+    for i in range(mask.S.length):
+        result[i] = 0
+    return result
+```
+
+### Mask index CMP
+
+Returns a mask whose active lanes satisfy `(x + laneIdx) CMP n`
+CMP one of the following: `eq`, `ne`, `lt`, `le`, `gt`, `ge`
+
+- `vec.m8.index_CMP(x: i32, n: i32) -> vec.m8`
+- `vec.m16.index_CMP(x: i32, n: i32) -> vec.m16`
+- `vec.m32.index_CMP(x: i32, n: i32) -> vec.m32`
+- `vec.m64.index_CMP(x: i32, n: i32) -> vec.m64`
+- `vec.m128.index_CMP(x: i32, n: i32) -> vec.m128`
+
+```python
+def vec.S.index_CMP(x, n):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        if (x + i) CMP n:
+            result[i] = 1
+        else:
+            result[i] = 0
+    return result
+```
+
+### Mask index first
+
+Returns the index of the first active lane.
+If there is no active lane, the index of the last lane is returned.
+
+- `vec.m8.index_first(m: vec.m8) -> i32`
+- `vec.m16.index_first(m: vec.m16) -> i32`
+- `vec.m32.index_first(m: vec.m32) -> i32`
+- `vec.m64.index_first(m: vec.m64) -> i32`
+- `vec.m128.index_first(m: vec.m128) -> i32`
+
+```python
+def vec.S.index_first(m):
+    for i in range(vec.S.length):
+        if m[i]:
+            return i
+    return vec.S.length - 1
+```
+
+### Mask index last
+
+Returns the index of the last active lane.
+If there is no active lane, the index of the first lane is returned.
+
+- `vec.m8.index_last(m: vec.m8) -> i32`
+- `vec.m16.index_last(m: vec.m16) -> i32`
+- `vec.m32.index_last(m: vec.m32) -> i32`
+- `vec.m64.index_last(m: vec.m64) -> i32`
+- `vec.m128.index_last(m: vec.m128) -> i32`
+
+```python
+def vec.S.index_last(m):
+    idx = 0
+    for i in range(vec.S.length):
+        if m[i]:
+            idx = i
+    return idx
+```
+
+### Mask first
+
+Returns a mask with a single active lane that corresponds to the first active lane of the input.
+If there is no active lanes, then an empty mask is returned.
+
+- `vec.m8.first(m: vec.m8) -> vec.m8`
+- `vec.m16.first(m: vec.m16) -> vec.m16`
+- `vec.m32.first(m: vec.m32) -> vec.m32`
+- `vec.m64.first(m: vec.m64) -> vec.m64`
+- `vec.m128.first(m: vec.m128) -> vec.m128`
+
+```python
+def vec.S.first(m):
+    result = mask.S.New()
+    for i in range(vec.S.length):
+        if m[i]:
+            result[i] = 1
+            break
+    return result
+```
+
+### Mask last
+
+Returns a mask with a single active lane that corresponds to the last active lane of the input.
+If there is no active lanes, then an empty mask is returned.
+
+- `vec.m8.last(m: vec.m8) -> vec.m8`
+- `vec.m16.last(m: vec.m16) -> vec.m16`
+- `vec.m32.last(m: vec.m32) -> vec.m32`
+- `vec.m64.last(m: vec.m64) -> vec.m64`
+- `vec.m128.last(m: vec.m128) -> vec.m128`
+
+```python
+def vec.S.last(m):
+    result = mask.S.New()
+    idx = -1
+    for i in range(vec.S.length):
+        if m[i]:
+            idx = i
+    if idx >= 0:
+        result[idx] = 1
+    return result
+```
+
+
+## Memory operations
+
+### Vector load
+
+- `vec.v8.load(a: memarg) -> vec.v8`
+- `vec.v16.load(a: memarg) -> vec.v16`
+- `vec.v32.load(a: memarg) -> vec.v32`
+- `vec.v64.load(a: memarg) -> vec.v64`
+- `vec.v128.load(a: memarg) -> vec.v128`
+
+### Vector load mask zero
+
+Inactive elements are set to `0`
+
+- `vec.v8.load_mz(m: vec.m8, a: memarg) -> vec.v8`
+- `vec.v16.load_mz(m: vec.m16, a: memarg) -> vec.v16`
+- `vec.v32.load_mz(m: vec.m32, a: memarg) -> vec.v32`
+- `vec.v64.load_mz(m: vec.m64, a: memarg) -> vec.v64`
+- `vec.v128.load_mz(m: vec.m128, a: memarg) -> vec.v128`
+
+### Vector load mask undefined
+
+Inactive elements have undefined values
+
+- `vec.v8.load_mx(m: vec.m8, a: memarg) -> vec.v8`
+- `vec.v16.load_mx(m: vec.m16, a: memarg) -> vec.v16`
+- `vec.v32.load_mx(m: vec.m32, a: memarg) -> vec.v32`
+- `vec.v64.load_mx(m: vec.m64, a: memarg) -> vec.v64`
+- `vec.v128.load_mx(m: vec.m128, a: memarg) -> vec.v128`
+
+### Vector load splat
+
+- `vec.v8.load_splat(a: memarg) -> vec.v8`
+- `vec.v16.load_splat(a: memarg) -> vec.v16`
+- `vec.v32.load_splat(a: memarg) -> vec.v32`
+- `vec.v64.load_splat(a: memarg) -> vec.v64`
+- `vec.v128.load_splat(a: memarg) -> vec.v128`
+
+### Vector store
+
+- `vec.v8.store(a: memarg, v: vec.v8)`
+- `vec.v16.store(a: memarg, v: vec.v16)`
+- `vec.v32.store(a: memarg, v: vec.v32)`
+- `vec.v64.store(a: memarg, v: vec.v64)`
+- `vec.v128.store(a: memarg, v: vec.v128)`
+
+### Vector store mask
+
+Inactive elements are not stored
+
+- `vec.v8.m_store(m: vec.m8, a: memarg, v: vec.v8)`
+- `vec.v16.m_store(m: vec.m16, a: memarg, v: vec.v16)`
+- `vec.v32.m_store(m: vec.m32, a: memarg, v: vec.v32)`
+- `vec.v64.m_store(m: vec.m64, a: memarg, v: vec.v64)`
+- `vec.v128.m_store(m: vec.m128, a: memarg, v: vec.v128)`
+
+## Lane operations
+
+### Splat scalar
+
+`idx` is interpreted modulo the length of the vector.
+
+- `vec.i8.splat(v: vec.v8, x: i32) -> vec.v8`
+- `vec.i16.splat(v: vec.v16, x: i32) -> vec.v16`
+- `vec.i32.splat(v: vec.v32, x: i32) -> vec.v32`
+- `vec.f32.splat(v: vec.v32, x: f32) -> vec.32`
+- `vec.i64.splat(v: vec.v64, x: i64) -> vec.v64`
+- `vec.f64.splat(v: vec.v64, x: f64) -> vec.v64`
+- `vec.v128.splat(v: vec.v128, x: v128) -> vec.v128`
+
+
+### Extract lane
+
+`idx` is interpreted modulo the length of the vector.
+
+- `vec.s8.extract_lane(v: vec.v8, idx: i32) -> i32`
+- `vec.u8.extract_lane(v: vec.v8, idx: i32) -> i32`
+- `vec.s16.extract_lane(v: vec.v16, idx: i32) -> i32`
+- `vec.u16.extract_lane(v: vec.v16, idx: i32) -> i32`
+- `vec.i32.extract_lane(v: vec.v32, idx: i32) -> i32`
+- `vec.f32.extract_lane(v: vec.v32, idx: i32) -> f32`
+- `vec.i64.extract_lane(v: vec.v64, idx: i32) -> i64`
+- `vec.f64.extract_lane(v: vec.v64, idx: i32) -> f64`
+- `vec.v128.extract_lane(v: vec.v128, idx: i32) -> v128`
+
+### Replace lane
+
+`idx` is interpreted modulo the length of the vector.
+
+- `vec.i8.extract_lane(v: vec.v8, idx: i32, x: i32) -> vec.v8`
+- `vec.i16.extract_lane(v: vec.v16, idx: i32, x: i32) -> vec.v16`
+- `vec.i32.extract_lane(v: vec.v32, idx: i32, x: i32) -> vec.v32`
+- `vec.f32.extract_lane(v: vec.v32, idx: i32, x: f32) -> vec.v32`
+- `vec.i64.extract_lane(v: vec.v64, idx: i32, x: i64) -> vec.v64`
+- `vec.f64.extract_lane(v: vec.v64, idx: i32, x: f64) -> vec.v64`
+- `vec.v128.extract_lane(v: vec.v128, idx: i32, x: v128) -> vec.v128`
+
+### Load lane
+
+Loads a single lane into existing vector.
+`idx` is interpreted modulo the length of the vector.
+
+- `vec.v8.load_lane(a: memarg, v: vec.v8, idx: i32) -> vec.v8`
+- `vec.v16.load_lane(a: memarg, v: vec.v16, idx: i32) -> vec.v16`
+- `vec.v32.load_lane(a: memarg, v: vec.v32, idx: i32) -> vec.v32`
+- `vec.v64.load_lane(a: memarg, v: vec.v64, idx: i32) -> vec.v64`
+- `vec.v128.load_lane(a: memarg, v: vec.v128, idx: i32) -> vec.v128`
+
+### Store lane
+
+Stores a single lane from vector
+`idx` is interpreted modulo the length of the vector.
+
+- `vec.v8.store_lane(a: memarg, v: vec.v8, idx: i32)`
+- `vec.v16.store_lane(a: memarg, v: vec.v16, idx: i32)`
+- `vec.v32.store_lane(a: memarg, v: vec.v32, idx: i32)`
+- `vec.v64.store_lane(a: memarg, v: vec.v64, idx: i32)`
+- `vec.v128.store_lane(a: memarg, v: vec.v128, idx: i32)`
+
+
+## Unary Arithmetic operators
+
+UNOP designates any unary operator (eg: neg, not)
+
+### UNOP
+
+- `vec.v8.UNOP(a: vec.v8) -> vec.v8`
+- `vec.v16.UNOP(a: vec.v16) -> vec.v16`
+- `vec.v32.UNOP(a: vec.v32) -> vec.v32`
+- `vec.v64.UNOP(a: vec.v64) -> vec.v64`
+- `vec.v128.UNOP(a: vec.v128) -> vec.v128`
+
+- `vec.m8.UNOP(a: vec.m8) -> vec.m8`
+- `vec.m16.UNOP(a: vec.m16) -> vec.m16`
+- `vec.m32.UNOP(a: vec.m32) -> vec.m32`
+- `vec.m64.UNOP(a: vec.m64) -> vec.m64`
+- `vec.m128.UNOP(a: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
+
+### UNOP mask zero
+
+Inactive lanes are set to zero.
+
+- `vec.v8.UNOP_mz(m: vec.m8, a: vec.v8) -> vec.v8`
+- `vec.v16.UNOP_mz(m: vec.m16, a: vec.v16) -> vec.v16`
+- `vec.v32.UNOP_mz(m: vec.m32, a: vec.v32) -> vec.v32`
+- `vec.v64.UNOP_mz(m: vec.m64, a: vec.v64) -> vec.v64`
+- `vec.v128.UNOP_mz(m: vec.m128, a: vec.v128) -> vec.v128`
+
+### UNOP mask merge
+
+Inactive lanes are left untouched.
+
+- `vec.v8.UNOP_mm(m: vec.m8, a: vec.v8) -> vec.v8`
+- `vec.v16.UNOP_mm(m: vec.m16, a: vec.v16) -> vec.v16`
+- `vec.v32.UNOP_mm(m: vec.m32, a: vec.v32) -> vec.v32`
+- `vec.v64.UNOP_mm(m: vec.m64, a: vec.v64) -> vec.v64`
+- `vec.v128.UNOP_mm(m: vec.m128, a: vec.v128) -> vec.v128`
+
+### UNOP mask undefined
+
+Inactive lanes are undefined.
+
+- `vec.v8.UNOP_mx(m: vec.m8, a: vec.v8) -> vec.v8`
+- `vec.v16.UNOP_mx(m: vec.m16, a: vec.v16) -> vec.v16`
+- `vec.v32.UNOP_mx(m: vec.m32, a: vec.v32) -> vec.v32`
+- `vec.v64.UNOP_mx(m: vec.m64, a: vec.v64) -> vec.v64`
+- `vec.v128.UNOP_mx(m: vec.m128, a: vec.v128) -> vec.v128`
+
+## Binary Arithmetic operators
+
+BINOP designates any binary operator that is not a comparison (eg: add, sub, rsub, mul, div, rdiv, and, or, xor...)
+
+### Select mask
+
+Selects active elements from `a` and inactive elements from `b`.
+
+- `vec.v8.select(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.select(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.select(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.select(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.select(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+### BINOP
+
+- `vec.v8.BINOP(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.BINOP(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.BINOP(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.BINOP(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.BINOP(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.BINOP(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.BINOP(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.BINOP(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.BINOP(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.BINOP(a: vec.m128, b: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
+
+### BINOP mask zero
+
+Inactive elements are set to zero.
+
+- `vec.v8.BINOP_mz(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.BINOP_mz(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.BINOP_mz(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.BINOP_mz(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.BINOP_mz(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.BINOP_mz(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.BINOP_mz(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.BINOP_mz(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.BINOP_mz(m: vec.m64, a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.BINOP_mz(m: vec.m128, a: vec.m128, b: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
+
+### BINOP mask merge
+
+Inactive elements are forwarded from `a`.
+
+- `vec.v8.BINOP_mm(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.BINOP_mm(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.BINOP_mm(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.BINOP_mm(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.BINOP_mm(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.BINOP_mm(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.BINOP_mm(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.BINOP_mm(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.BINOP_mm(m: vec.m64, a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.BINOP_mm(m: vec.m128, a: vec.m128, b: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
+
+### BINOP mask undefined
+
+Inactive elements are undefined.
+
+- `vec.v8.BINOP_mx(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.BINOP_mx(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.BINOP_mx(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.BINOP_mx(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.BINOP_mx(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+
+## Comparisons
+
+CMP designates any comparison operator (eg: `eq_u`, `ne_s`, `lt_f`, `le_s`, `gt_f`, `ge_u`)
+
+### CMP
+
+- `vec.m8.CMP(a: vec.v8, b: vec.v8) -> vec.m8`
+- `vec.m16.CMP(a: vec.v16, b: vec.v16) -> vec.m16`
+- `vec.m32.CMP(a: vec.v32, b: vec.v32) -> vec.m32`
+- `vec.m64.CMP(a: vec.v64, b: vec.v64) -> vec.m64`
+
+```python
+def vec.S.CMP(a, b):
+    result = mask.S.New()
+    for i in range(mask.S.length):
+        if a[i] CMP b[i]:
+            result[i] = 1
+        else:
+            result[i] = 0
+    return result
+```
+
+### CMP mask
+
+Inactive elements are set to `0`.
+
+- `vec.m8.CMP_m(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.m8`
+- `vec.m16.CMP_m(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.m16`
+- `vec.m32.CMP_m(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.m32`
+- `vec.m64.CMP_m(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.m64`
+
+```python
+def vec.S.CMP_m(m, a, b):
+    result = mask.S.New()
+    for i in range(mask.S.length):
+        if m[i] and a[i] CMP b[i]:
+            result[i] = 1
+        else:
+            result[i] = 0
+    return result
+```
+
+### Sign to mask
+
+For each lane, the mask lane is set to `1` if the element is negative (floats are not interpreted), and `0` otherwise.
+
+- `vec.m8.sign(a: vec.v8) -> vec.m8`
+- `vec.m16.sign(a: vec.v16) -> vec.m16`
+- `vec.m32.sign(a: vec.v32) -> vec.m32`
+- `vec.m64.sign(a: vec.v64) -> vec.m64`
+
+## Inter-lane operations
+
+### LUT1 zero
+
+Elements whose index is out of bounds are set to `0`.
+
+- `vec.v8.lut1_z(idx: vec.v8, a: vec.v8) -> vec.v8`
+- `vec.v16.lut1_z(idx: vec.v16, a: vec.v16) -> vec.v16`
+- `vec.v32.lut1_z(idx: vec.v32, a: vec.v32) -> vec.v32`
+- `vec.v64.lut1_z(idx: vec.v64, a: vec.v64) -> vec.v64`
+- `vec.v128.lut1_z(idx: vec.v128, a: vec.v128) -> vec.v128`
+
+```python
+def vec.S.lut1_z(idx, a):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        if idx[i] < vec.S.length:
+            result[i] = a[idx[i]]
+        else:
+            result[i] = 0
+    return result
+```
+### LUT1 merge
+
+Elements whose index is out of bounds are taken from fallback.
+
+- `vec.v8.lut1_m(idx: vec.v8, a: vec.v8, fallback: vec.v8) -> vec.v8`
+- `vec.v16.lut1_m(idx: vec.v16, a: vec.v16, fallback: vec.v16) -> vec.v16`
+- `vec.v32.lut1_m(idx: vec.v32, a: vec.v32, fallback: vec.v32) -> vec.v32`
+- `vec.v64.lut1_m(idx: vec.v64, a: vec.v64, fallback: vec.v64) -> vec.v64`
+
+```python
+def vec.S.lut1_m(idx, a, fallback):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        if idx[i] < vec.S.length:
+            result[i] = a[idx[i]]
+        else:
+            result[i] = fallback[i]
+    return result
+```
+
+### V128 shuffle
+
+Applies shuffle to each v128 of the vector.
+
+- `vec.i8x16.shuffle(a: vec.v128, b: vec.v128, imm: ImmLaneIdx32[16]) -> vec.v128`
+
+```python
+def vec.i8x16.shuffle(a, b, imm):
+    result = vec.v128.New()
+    for i in range(vec.v128.length):
+        result[i] = i8x16.shuffle(a[i], b[i], imm)
+    return result
+```
+
+### V128 swizzle
+
+Applies swizzle to each v128 of the vector.
+
+- `vec.i8x16.swizzle(a: vec.v128, s: vec.v128) -> vec.v128`
+
+```python
+def vec.i8x16.swizzle(idx, a, s):
+    result = vec.v128.New()
+    for i in range(vec.v128.length):
+        result[i] = i8x16.swizzle(a[i], s[i], imm)
+    return result
+```
+
+### Splat lane
+
+Gets a single lane from vector and broadcast it to the entire vector.
+`idx` is interpreted modulo the cardinal of the vector.
+
+- `vec.v8.splat_lane(v: vec.v8, idx: i32) -> vec.v8`
+- `vec.v16.splat_lane(v: vec.v16, idx: i32) -> vec.v16`
+- `vec.v32.splat_lane(v: vec.v32, idx: i32) -> vec.v32`
+- `vec.v64.splat_lane(v: vec.v64, idx: i32) -> vec.v64`
+- `vec.v128.splat_lane(v: vec.v128, idx: i32) -> vec.v128`
+
+```python
+def vec.S.splat_lane(v, imm):
+    idx = idx % vec.S.length
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        result[i] = v[idx]
+    return result
+```
+
+### Concat
+
+Copies elements from vector `a` from first active element to last active element.
+Inner inactive elements are also copied.
+The remaining elements are set from the first elements from `b`.
+
+- `vec.v8.concat(m: vec.m8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.concat(m: vec.m16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.concat(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.concat(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.concat(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+
+```python
+def vec.S.concat(m, a, b):
+    begin = -1
+    end = -1
+    for i in range(vec.S.length):
+        if m[i]:
+            end = i + 1
+            if begin < 0:
+                begin = i
+
+    result = vec.S.New()
+    i = 0
+    for j in range(begin, end):
+        result[i] = a[j]
+        i += 1
+    for j in range(0, vec.S.length - i):
+        result[i] = b[j]
+        i += 1
+    return result
+```
+
+### Lane shift
+
+Concats the 2 input vector to form a single double-width vector.
+Shifts this double-width vector by `n` lane to the left (to LSB).
+Extracts the lower half of the shifted vector.
+`n` is interpreted modulo the length of the vector.
+
+
+- `vec.v8.lane_shift(a: vec.v8, b: vec.v8, n: i32) -> vec.v8`
+- `vec.v16.lane_shift(a: vec.v16, b: vec.v16, n: i32) -> vec.v16`
+- `vec.v32.lane_shift(a: vec.v32, b: vec.v32, n: i32) -> vec.v32`
+- `vec.v64.lane_shift(a: vec.v64, b: vec.v64, n: i32) -> vec.v64`
+- `vec.v128.lane_shift(a: vec.v128, b: vec.v128, n: i32) -> vec.v128`
+
+```python
+def vec.S.lane_shift(a, b, n):
+    result = vec.S.New()
+    n = n % vec.S.length
+    for i in range(0, n):
+        result[i] = a[i + n]
+    for i in range(n, vec.S.length):
+        result[i] = b[i - n]
+    return result
+```
+
+### Interleave even
+
+Extracts even elements from both input and interleaves them.
+
+- `vec.v8.interleave_even(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.interleave_even(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8`
+- `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16`
+- `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32`
+- `vec.m64.interleave_even(a: vec.m64, b: mec.m64) -> vec.m64`
+- `vec.m128.interleave_even(a: vec.m128, b: mec.m128) -> vec.m128`
+
+
+```python
+def vec.S.interleave_even(a, b):
+    result = vec.S.New()
+    for i in range(vec.S.length/2):
+        result[2*i] = a[2*i]
+        result[2*i + 1] = b[2*i]
+    return result
+```
+
+Note:
+
+> - can be implemented with `TRN1` on Neon/SVE
+
+### Interleave odd
+
+Extracts odd elements from both input and interleaves them.
+
+- `vec.v8.interleave_odd(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.interleave_odd(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.interleave_odd(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.interleave_odd(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.interleave_odd(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.interleave_odd(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.interleave_odd(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.interleave_odd(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.interleave_odd(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.interleave_odd(a: vec.m128, b: vec.m128) -> vec.m128`
+
+
+```python
+def vec.S.interleave_odd(a, b):
+    result = vec.S.New()
+    for i in range(vec.S.length/2):
+        result[2*i] = a[2*i+1]
+        result[2*i + 1] = b[2*i+1]
+    return result
+```
+
+Note:
+
+> - can be implemented with `TRN2` on Neon/SVE
+
+### Concat even
+
+Extracts even elements from both input and concatenate them.
+
+- `vec.v8.concat_even(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.concat_even(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.concat_even(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.concat_even(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.concat_even(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.concat_even(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.concat_even(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.concat_even(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.concat_even(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.concat_even(a: vec.m128, b: vec.m128) -> vec.m128`
+
+
+```python
+def vec.S.concat_even(a, b):
+    result = vec.S.New()
+    
+    for i in range(vec.S.length/2):
+        result[i] = a[2*i]
+    for i in range(vec.S.length/2):
+        result[i + vec.S.length/2] = b[2*i]
+    return result
+```
+
+Note:
+
+> - can be implemented with `UZP1` on Neon/SVE
+> - Wrapping narrowing integer conversions could be implemented with this function
+
+### Concat odd
+
+Extracts odd elements from both input and concatenate them.
+
+- `vec.v8.concat_odd(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.concat_odd(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.concat_odd(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.concat_odd(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.concat_odd(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.concat_odd(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.concat_odd(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.concat_odd(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.concat_odd(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.concat_odd(a: vec.m128, b: vec.m128) -> vec.m128`
+
+
+```python
+def vec.S.concat_odd(a, b):
+    result = vec.S.New()
+    
+    for i in range(vec.S.length/2):
+        result[i] = a[2*i+1]
+    for i in range(vec.S.length/2):
+        result[i + vec.S.length/2] = b[2*i+1]
+    return result
+```
+
+Note:
+
+> - can be implemented with `UZP2` on Neon/SVE
+
+### Interleave low
+
+Extracts the lower half of both input and interleaves their elements.
+
+- `vec.v8.interleave_low(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.interleave_low(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.interleave_low(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.interleave_low(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.interleave_low(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.interleave_low(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.interleave_low(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.interleave_low(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.interleave_low(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.interleave_low(a: vec.m128, b: vec.m128) -> vec.m128`
+
+
+```python
+def vec.S.interleave_low(a, b):
+    result = vec.S.New()
+    for i in range(vec.S.length/2):
+        result[2*i] = a[i]
+        result[2*i + 1] = b[i]
+    return result
+```
+
+Note:
+
+> - can be implemented with `ZIP1` on Neon/SVE
+
+### Interleave high
+
+Extracts the higher half of both input and interleaves their elements.
+
+- `vec.v8.interleave_high(a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.interleave_high(a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.interleave_high(a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.interleave_high(a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.interleave_high(a: vec.v128, b: vec.v128) -> vec.v128`
+
+- `vec.m8.interleave_high(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.interleave_high(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.interleave_high(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.interleave_high(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.interleave_high(a: vec.m128, b: vec.m128) -> vec.m128`
+
+
+```python
+def vec.S.interleave_high(a, b):
+    result = vec.S.New()
+    for i in range(vec.S.length/2):
+        result[2*i] = a[i + vec.S.length/2]
+        result[2*i + 1] = b[i + vec.S.length/2]
+    return result
+```
+
+Note:
+
+> - can be implemented with `ZIP2` on Neon/SVE
+
+## Conversions
+
+### Narrowing conversions
+
+Converts each elements of both inputs to narrower types using saturation, and concats them.
+
+- `vec.i8.narrow_i16_u(a: vec.v16, b: vec.v16) -> vec.v8`
+- `vec.i8.narrow_i16_s(a: vec.v16, b: vec.v16) -> vec.v8`
+- `vec.i16.narrow_i32_u(a: vec.v32, b: vec.v32) -> vec.v16`
+- `vec.i16.narrow_i32_s(a: vec.v32, b: vec.v32) -> vec.v16`
+- `vec.i32.narrow_i64_u(a: vec.v64, b: vec.v64) -> vec.v32`
+- `vec.i32.narrow_i64_s(a: vec.v64, b: vec.v64) -> vec.v32`
+
+### Mask narrowing
+
+Returns a `mask` for a narrower type with the same active lanes
+
+- `vec.m8.narrow_m16(a: vec.m16, b: vec.m16) -> vec.m8`
+- `vec.m16.narrow_m32(a: vec.m32, b: vec.m32) -> vec.m16`
+- `vec.m32.narrow_m64(a: vec.m64, b: vec.m64) -> vec.m32`
+- `vec.m64.narrow_m128(a: vec.m128, b: vec.m128) -> vec.m64`
+
+```python
+def mask.S.narrow(a, b):
+    result = mask.S.New()
+    for i in range(mask.S.length/2):
+        result[i] = a[i]
+    for i in range(mask.S.length/2):
+        result[i + mask.S.length/2] = a[i]
+    return result
+```
+
+### Widening conversions
+
+- `vec.i16.widen_low_i8_u(a: vec.v8) -> vec.v16`
+- `vec.i16.widen_low_i8_s(a: vec.v8) -> vec.v16`
+- `vec.i32.widen_low_i16_u(a: vec.v8) -> vec.v32`
+- `vec.i32.widen_low_i16_s(a: vec.v8) -> vec.v32`
+- `vec.i64.widen_low_i32_u(a: vec.v8) -> vec.v64`
+- `vec.i64.widen_low_i32_s(a: vec.v8) -> vec.v64`
+
+- `vec.i16.widen_high_i8_u(a: vec.v8) -> vec.v16`
+- `vec.i16.widen_high_i8_s(a: vec.v8) -> vec.v16`
+- `vec.i32.widen_high_i16_u(a: vec.v8) -> vec.v32`
+- `vec.i32.widen_high_i16_s(a: vec.v8) -> vec.v32`
+- `vec.i64.widen_high_i32_u(a: vec.v8) -> vec.v64`
+- `vec.i64.widen_high_i32_s(a: vec.v8) -> vec.v64`
+
+### Mask widening
+
+Returns a `mask` for a wider type with the same active lanes as the lower/higher part of the original `mask`
+
+- `vec.m16.widen_low_m8(m: vec.m8) -> vec.m16`
+- `vec.m32.widen_low_m16(m: vec.m16) -> vec.m32`
+- `vec.m64.widen_low_m32(m: vec.m32) -> vec.m64`
+- `vec.m128.widen_low_m64(m: vec.m64) -> vec.m128`
+
+- `vec.m16.widen_high_m8(m: vec.m8) -> vec.m16`
+- `vec.m32.widen_high_m16(m: vec.m16) -> vec.m32`
+- `vec.m64.widen_high_m32(m: vec.m32) -> vec.m64`
+- `vec.m128.widen_high_m64(m: vec.m64) -> vec.m128`
+
+```python
+def vec.S.widen_low_T(m):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        result[i] = m[i]
+    return result
+```
+
+```python
+def mask.S.widen_high_T(m):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        result[i] = m[i + vec.S.length/2]
+    return result
+```
+
+### Floating point promotion
+
+- `vec.f64.promote_low_f32(a: vec.v32) -> vec.v64`
+- `vec.f64.promote_high_f32(a: vec.v32) -> vec.v64`
+
+### Floating point demotion
+
+- `vec.f32.demote_f64(a: vec.v64, b: vec.v64) -> vec.v32`
+
+### Integer to single-precision floating point
+
+- `vec.f32.convert_i32_s(a: vec.v32) -> vec.v32`
+- `vec.f32.convert_i32_u(a: vec.v32) -> vec.v32`
+- `vec.f32.convert_i64_s(a: vec.v64, b: vec.v64) -> vec.v32`
+- `vec.f32.convert_i64_u(a: vec.v64, b: vec.v64) -> vec.v32`
+
+### Integer to double-precision floating point
+
+- `vec.f64.convert_low_i32_s(a: vec.v32) -> vec.v64`
+- `vec.f64.convert_low_i32_u(a: vec.v32) -> vec.v64`
+- `vec.f64.convert_high_i32_s(a: vec.v32) -> vec.v64`
+- `vec.f64.convert_high_i32_u(a: vec.v32) -> vec.v64`
+- `vec.f64.convert_i64_s(a: vec.v64) -> vec.v64`
+- `vec.f64.convert_i64_u(a: vec.v64) -> vec.v64`
+
+### single precision floating point to integer with saturation
+
+- `vec.i32.trunc_sat_f32_s(a: vec.v32) -> vec.v32`
+- `vec.i32.trunc_sat_f32_u(a: vec.v32) -> vec.v32`
+- `vec.i64.trunc_sat_low_f32_s(a: vec.v32) -> vec.v64`
+- `vec.i64.trunc_sat_low_f32_u(a: vec.v32) -> vec.v64`
+- `vec.i64.trunc_sat_high_f32_s(a: vec.v32) -> vec.v64`
+- `vec.i64.trunc_sat_high_f32_u(a: vec.v32) -> vec.v64`
+
+### double precision floating point to integer with saturation
+
+- `vec.i32.trunc_sat_f64_s(a: vec.v64, b: vec.v64) -> vec.v32`
+- `vec.i32.trunc_sat_f64_u(a: vec.v64, b: vec.v64) -> vec.v32`
+- `vec.i64.trunc_sat_f64_s(a: vec.v64) -> vec.v64`
+- `vec.i64.trunc_sat_f64_u(a: vec.v64) -> vec.v64`
+
+### Reinterpret casts
+
+- `vec.v8.cast_v16(a: vec.v16) -> vec.v8`
+- `vec.v8.cast_v32(a: vec.v32) -> vec.v8`
+- `vec.v8.cast_v64(a: vec.v64) -> vec.v8`
+- `vec.v8.cast_v128(a: vec.v128) -> vec.v8`
+- `vec.v16.cast_v8(a: vec.v8) -> vec.v16`
+- `vec.v16.cast_v32(a: vec.v32) -> vec.v16`
+- `vec.v16.cast_v64(a: vec.v64) -> vec.v16`
+- `vec.v16.cast_v128(a: vec.v128) -> vec.v16`
+- `vec.v32.cast_v8(a: vec.v8) -> vec.v32`
+- `vec.v32.cast_v16(a: vec.v16) -> vec.v32`
+- `vec.v32.cast_v64(a: vec.v64) -> vec.v32`
+- `vec.v32.cast_v128(a: vec.v128) -> vec.v32`
+- `vec.v64.cast_v8(a: vec.v8) -> vec.v64`
+- `vec.v64.cast_v16(a: vec.v16) -> vec.v64`
+- `vec.v64.cast_v32(a: vec.v32) -> vec.v64`
+- `vec.v64.cast_v128(a: vec.v128) -> vec.v64`
+- `vec.v128.cast_v8(a: vec.v8) -> vec.v128`
+- `vec.v128.cast_v16(a: vec.v16) -> vec.v128`
+- `vec.v128.cast_v32(a: vec.v32) -> vec.v128`
+- `vec.v128.cast_v64(a: vec.v64) -> vec.v128`
+
+### Mask cast
+
+- `vec.m8.cast_m16(m: vec.m16) -> vec.m8`
+- `vec.m8.cast_m32(m: vec.m32) -> vec.m8`
+- `vec.m8.cast_m64(m: vec.m64) -> vec.m8`
+- `vec.m8.cast_m128(m: vec.m128) -> vec.m8`
+- `vec.m16.cast_m8(m: vec.m8) -> vec.m16`
+- `vec.m16.cast_m32(m: vec.m32) -> vec.m16`
+- `vec.m16.cast_m64(m: vec.m64) -> vec.m16`
+- `vec.m16.cast_m128(m: vec.m128) -> vec.m16`
+- `vec.m32.cast_m8(m: vec.m8) -> vec.m32`
+- `vec.m32.cast_m16(m: vec.m16) -> vec.m32`
+- `vec.m32.cast_m64(m: vec.m64) -> vec.m32`
+- `vec.m32.cast_m128(m: vec.m128) -> vec.m32`
+- `vec.m64.cast_m8(m: vec.m8) -> vec.m64`
+- `vec.m64.cast_m16(m: vec.m16) -> vec.m64`
+- `vec.m64.cast_m32(m: vec.m32) -> vec.m64`
+- `vec.m64.cast_m128(m: vec.m128) -> vec.m64`
+- `vec.m128.cast_m8(m: vec.m8) -> vec.m128`
+- `vec.m128.cast_m16(m: vec.m16) -> vec.m128`
+- `vec.m128.cast_m32(m: vec.m32) -> vec.m128`
+- `vec.m128.cast_m64(m: vec.m64) -> vec.m128`
+
+
+```python
+def vec.S.cast_T(m):
+    result = vec.S.New()
+    if vec.T.length < vec.S.length:
+        d = vec.S.length / vec.T.length
+        for i in range(vec.T.length):
+            for j in range(d):
+            result[i*d + j] = m[i]
+    else:
+        d = vec.T.length / vec.S.length
+        for i in range(vec.S.length):
+            result[i] = m[i * d]
+    return result
+```
+
+### Mask to vec
+
+Active lanes are to `-1` (all one bits), and inactive lanes are set to 0.
+
+- `vec.v8.convert_m8(m: vec.m8) -> vec.v8`
+- `vec.v16.convert_m16(m: vec.m16) -> vec.v16`
+- `vec.v32.convert_m32(m: vec.m32) -> vec.v32`
+- `vec.v64.convert_m64(m: vec.m64) -> vec.v64`
+- `vec.v128.convert_m128(m: vec.m128) -> vec.v128`
+
+## Test masks
+
+### Test none
+
+Returns `1` if and only if all lanes are inactive.
+Returns `0` otherwise.
+
+- `vec.m8.test_none(m: vec.m8) -> i32`
+- `vec.m16.test_none(m: vec.m16) -> i32`
+- `vec.m32.test_none(m: vec.m32) -> i32`
+- `vec.m64.test_none(m: vec.m64) -> i32`
+- `vec.m128.test_none(m: vec.m128) -> i32`
+
+### Test any
+
+Returns `1` if and only if there is at least one active lane.
+Returns `0` otherwise.
+
+- `vec.m8.test_any(m: vec.m8) -> i32`
+- `vec.m16.test_any(m: vec.m16) -> i32`
+- `vec.m32.test_any(m: vec.m32) -> i32`
+- `vec.m64.test_any(m: vec.m64) -> i32`
+- `vec.m128.test_any(m: vec.m128) -> i32`
+
+### Test all
+
+Returns `1` if and only if all lanes are active.
+Returns `0` otherwise.
+
+- `vec.m8.test_all(m: vec.m8) -> i32`
+- `vec.m16.test_all(m: vec.m16) -> i32`
+- `vec.m32.test_all(m: vec.m32) -> i32`
+- `vec.m64.test_all(m: vec.m64) -> i32`
+- `vec.m128.test_all(m: vec.m128) -> i32`

From 915c6a5c14e798b62045b9be61a0bbf566657179 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Sat, 27 Feb 2021 17:32:14 +0100
Subject: [PATCH 2/6] Fix tiny formatting issues with lists

---
 proposals/flexible-vectors/README.md | 30 +++++++++++++++++-----------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index 2a74d54..49f6f74 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -390,7 +390,6 @@ UNOP designates any unary operator (eg: neg, not)
 - `vec.v32.UNOP(a: vec.v32) -> vec.v32`
 - `vec.v64.UNOP(a: vec.v64) -> vec.v64`
 - `vec.v128.UNOP(a: vec.v128) -> vec.v128`
-
 - `vec.m8.UNOP(a: vec.m8) -> vec.m8`
 - `vec.m16.UNOP(a: vec.m16) -> vec.m16`
 - `vec.m32.UNOP(a: vec.m32) -> vec.m32`
@@ -410,6 +409,15 @@ Inactive lanes are set to zero.
 - `vec.v32.UNOP_mz(m: vec.m32, a: vec.v32) -> vec.v32`
 - `vec.v64.UNOP_mz(m: vec.m64, a: vec.v64) -> vec.v64`
 - `vec.v128.UNOP_mz(m: vec.m128, a: vec.v128) -> vec.v128`
+- `vec.m8.UNOP_mz(m: vec.m8, a: vec.m8) -> vec.m8`
+- `vec.m16.UNOP_mz(m: vec.m16, a: vec.m16) -> vec.m16`
+- `vec.m32.UNOP_mz(m: vec.m32, a: vec.m32) -> vec.m32`
+- `vec.m64.UNOP_mz(m: vec.m64, a: vec.m64) -> vec.m64`
+- `vec.m128.UNOP_mz(m: vec.m128, a: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
 
 ### UNOP mask merge
 
@@ -420,6 +428,15 @@ Inactive lanes are left untouched.
 - `vec.v32.UNOP_mm(m: vec.m32, a: vec.v32) -> vec.v32`
 - `vec.v64.UNOP_mm(m: vec.m64, a: vec.v64) -> vec.v64`
 - `vec.v128.UNOP_mm(m: vec.m128, a: vec.v128) -> vec.v128`
+- `vec.m8.UNOP_mm(m: vec.m8, a: vec.m8) -> vec.m8`
+- `vec.m16.UNOP_mm(m: vec.m16, a: vec.m16) -> vec.m16`
+- `vec.m32.UNOP_mm(m: vec.m32, a: vec.m32) -> vec.m32`
+- `vec.m64.UNOP_mm(m: vec.m64, a: vec.m64) -> vec.m64`
+- `vec.m128.UNOP_mm(m: vec.m128, a: vec.m128) -> vec.m128`
+
+Note:
+
+> - Masks only support bitwise operations.
 
 ### UNOP mask undefined
 
@@ -452,7 +469,6 @@ Selects active elements from `a` and inactive elements from `b`.
 - `vec.v32.BINOP(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.BINOP(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.BINOP(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.BINOP(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.BINOP(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.BINOP(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -472,7 +488,6 @@ Inactive elements are set to zero.
 - `vec.v32.BINOP_mz(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.BINOP_mz(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.BINOP_mz(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.BINOP_mz(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.BINOP_mz(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.BINOP_mz(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32`
@@ -492,7 +507,6 @@ Inactive elements are forwarded from `a`.
 - `vec.v32.BINOP_mm(m: vec.m32, a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.BINOP_mm(m: vec.m64, a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.BINOP_mm(m: vec.m128, a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.BINOP_mm(m: vec.m8, a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.BINOP_mm(m: vec.m16, a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.BINOP_mm(m: vec.m32, a: vec.m32, b: vec.m32) -> vec.m32`
@@ -723,7 +737,6 @@ Extracts even elements from both input and interleaves them.
 - `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8`
 - `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16`
 - `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32`
@@ -753,7 +766,6 @@ Extracts odd elements from both input and interleaves them.
 - `vec.v32.interleave_odd(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.interleave_odd(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.interleave_odd(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.interleave_odd(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.interleave_odd(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.interleave_odd(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -783,7 +795,6 @@ Extracts even elements from both input and concatenate them.
 - `vec.v32.concat_even(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.concat_even(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.concat_even(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.concat_even(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.concat_even(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.concat_even(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -816,7 +827,6 @@ Extracts odd elements from both input and concatenate them.
 - `vec.v32.concat_odd(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.concat_odd(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.concat_odd(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.concat_odd(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.concat_odd(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.concat_odd(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -848,7 +858,6 @@ Extracts the lower half of both input and interleaves their elements.
 - `vec.v32.interleave_low(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.interleave_low(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.interleave_low(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.interleave_low(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.interleave_low(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.interleave_low(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -878,7 +887,6 @@ Extracts the higher half of both input and interleaves their elements.
 - `vec.v32.interleave_high(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.interleave_high(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.interleave_high(a: vec.v128, b: vec.v128) -> vec.v128`
-
 - `vec.m8.interleave_high(a: vec.m8, b: vec.m8) -> vec.m8`
 - `vec.m16.interleave_high(a: vec.m16, b: vec.m16) -> vec.m16`
 - `vec.m32.interleave_high(a: vec.m32, b: vec.m32) -> vec.m32`
@@ -939,7 +947,6 @@ def mask.S.narrow(a, b):
 - `vec.i32.widen_low_i16_s(a: vec.v8) -> vec.v32`
 - `vec.i64.widen_low_i32_u(a: vec.v8) -> vec.v64`
 - `vec.i64.widen_low_i32_s(a: vec.v8) -> vec.v64`
-
 - `vec.i16.widen_high_i8_u(a: vec.v8) -> vec.v16`
 - `vec.i16.widen_high_i8_s(a: vec.v8) -> vec.v16`
 - `vec.i32.widen_high_i16_u(a: vec.v8) -> vec.v32`
@@ -955,7 +962,6 @@ Returns a `mask` for a wider type with the same active lanes as the lower/higher
 - `vec.m32.widen_low_m16(m: vec.m16) -> vec.m32`
 - `vec.m64.widen_low_m32(m: vec.m32) -> vec.m64`
 - `vec.m128.widen_low_m64(m: vec.m64) -> vec.m128`
-
 - `vec.m16.widen_high_m8(m: vec.m8) -> vec.m16`
 - `vec.m32.widen_high_m16(m: vec.m16) -> vec.m32`
 - `vec.m64.widen_high_m32(m: vec.m32) -> vec.m64`

From 427d8f5c02fa5531430352f9bc5c139bb08c265d Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Sat, 27 Feb 2021 17:32:19 +0100
Subject: [PATCH 3/6] Added LUT2, improved LUT1 description

---
 proposals/flexible-vectors/README.md | 56 +++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index 49f6f74..ff44a05 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -583,6 +583,7 @@ For each lane, the mask lane is set to `1` if the element is negative (floats ar
 
 ### LUT1 zero
 
+Gets elements from `a` located at the index specified by `idx`.
 Elements whose index is out of bounds are set to `0`.
 
 - `vec.v8.lut1_z(idx: vec.v8, a: vec.v8) -> vec.v8`
@@ -601,9 +602,11 @@ def vec.S.lut1_z(idx, a):
             result[i] = 0
     return result
 ```
+
 ### LUT1 merge
 
-Elements whose index is out of bounds are taken from fallback.
+Gets elements from `a` located at the index specified by `idx`.
+Elements whose index is out of bounds are taken from `fallback`.
 
 - `vec.v8.lut1_m(idx: vec.v8, a: vec.v8, fallback: vec.v8) -> vec.v8`
 - `vec.v16.lut1_m(idx: vec.v16, a: vec.v16, fallback: vec.v16) -> vec.v16`
@@ -621,6 +624,55 @@ def vec.S.lut1_m(idx, a, fallback):
     return result
 ```
 
+### LUT2 zero
+
+Gets elements from `a` and `b` located at the index specified by `idx`.
+If the index is lower than length, elements are taken from `a`, if index is between length and 2 * length, elements are taken from `b`.
+Elements whose index is out of bounds are set to `0`.
+
+- `vec.v8.lut2_z(idx: vec.v8, a: vec.v8, b: vec.v8) -> vec.v8`
+- `vec.v16.lut2_z(idx: vec.v16, a: vec.v16, b: vec.v16) -> vec.v16`
+- `vec.v32.lut2_z(idx: vec.v32, a: vec.v32, b: vec.v32) -> vec.v32`
+- `vec.v64.lut2_z(idx: vec.v64, a: vec.v64, b: vec.v64) -> vec.v64`
+- `vec.v128.lut2_z(idx: vec.v128, a: vec.v128, b: vec.v128) -> vec.v128`
+
+```python
+def vec.S.lut2_z(idx, a):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        if idx[i] < vec.S.length:
+            result[i] = a[idx[i]]
+        elif idx[i] < 2*vec.S.length:
+            result[i] = b[idx[i] - vec.S.length]
+        else:
+            result[i] = 0
+    return result
+```
+### LUT2 merge
+
+Gets elements from `a` and `b` located at the index specified by `idx`.
+If the index is lower than length, elements are taken from `a`, if index is between length and 2 * length, elements are taken from `b`.
+Elements whose index is out of bounds are taken from fallback.
+
+- `vec.v8.lut2_m(idx: vec.v8, a: vec.v8, b: vec.v8, fallback: vec.v8) -> vec.v8`
+- `vec.v16.lut2_m(idx: vec.v16, a: vec.v16, b: vec.v16, fallback: vec.v16) -> vec.v16`
+- `vec.v32.lut2_m(idx: vec.v32, a: vec.v32, b: vec.v32, fallback: vec.v32) -> vec.v32`
+- `vec.v64.lut2_m(idx: vec.v64, a: vec.v64, b: vec.v64, fallback: vec.v64) -> vec.v64`
+- `vec.v128.lut2_m(idx: vec.v128, a: vec.v128, b: vec.v128, fallback: vec.v128) -> vec.v128`
+
+```python
+def vec.S.lut2_m(idx, a, b, fallback):
+    result = vec.S.New()
+    for i in range(vec.S.length):
+        if idx[i] < vec.S.length:
+            result[i] = a[idx[i]]
+        elif idx[i] < 2*vec.S.length:
+            result[i] = b[idx[i] - vec.S.length]
+        else:
+            result[i] = fallback[i]
+    return result
+```
+
 ### V128 shuffle
 
 Applies shuffle to each v128 of the vector.
@@ -1088,7 +1140,7 @@ def vec.S.cast_T(m):
 
 ### Mask to vec
 
-Active lanes are to `-1` (all one bits), and inactive lanes are set to 0.
+Active lanes are set to `-1` (all one bits), and inactive lanes are set to `0`.
 
 - `vec.v8.convert_m8(m: vec.m8) -> vec.v8`
 - `vec.v16.convert_m16(m: vec.m16) -> vec.v16`

From 764d69c3d88be73602118be44e27a0593b6106a8 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Sat, 12 Jun 2021 14:09:54 +0200
Subject: [PATCH 4/6] Fixed typos + fixed shift

---
 proposals/flexible-vectors/README.md | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index ff44a05..7fd073b 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -178,7 +178,7 @@ def vec.S.index_CMP(x, n):
 ### Mask index first
 
 Returns the index of the first active lane.
-If there is no active lane, the index of the last lane is returned.
+If there is no active lane, the length of the vector is returned.
 
 - `vec.m8.index_first(m: vec.m8) -> i32`
 - `vec.m16.index_first(m: vec.m16) -> i32`
@@ -191,13 +191,13 @@ def vec.S.index_first(m):
     for i in range(vec.S.length):
         if m[i]:
             return i
-    return vec.S.length - 1
+    return vec.S.length
 ```
 
 ### Mask index last
 
 Returns the index of the last active lane.
-If there is no active lane, the index of the first lane is returned.
+If there is no active lane, -1 is returned.
 
 - `vec.m8.index_last(m: vec.m8) -> i32`
 - `vec.m16.index_last(m: vec.m16) -> i32`
@@ -207,7 +207,7 @@ If there is no active lane, the index of the first lane is returned.
 
 ```python
 def vec.S.index_last(m):
-    idx = 0
+    idx = -1
     for i in range(vec.S.length):
         if m[i]:
             idx = i
@@ -758,7 +758,7 @@ def vec.S.concat(m, a, b):
 ### Lane shift
 
 Concats the 2 input vector to form a single double-width vector.
-Shifts this double-width vector by `n` lane to the left (to LSB).
+Shifts this double-width vector by `n` lane to the right (to LSB).
 Extracts the lower half of the shifted vector.
 `n` is interpreted modulo the length of the vector.
 
@@ -773,10 +773,10 @@ Extracts the lower half of the shifted vector.
 def vec.S.lane_shift(a, b, n):
     result = vec.S.New()
     n = n % vec.S.length
-    for i in range(0, n):
+    for i in range(0, vec.S.length - n):
         result[i] = a[i + n]
-    for i in range(n, vec.S.length):
-        result[i] = b[i - n]
+    for i in range(vec.S.length - n, vec.S.length):
+        result[i] = b[i - (vec.S.length - n)]
     return result
 ```
 
@@ -789,11 +789,11 @@ Extracts even elements from both input and interleaves them.
 - `vec.v32.interleave_even(a: vec.v32, b: vec.v32) -> vec.v32`
 - `vec.v64.interleave_even(a: vec.v64, b: vec.v64) -> vec.v64`
 - `vec.v128.interleave_even(a: vec.v128, b: vec.v128) -> vec.v128`
-- `vec.m8.interleave_even(a: vec.m8, b: mec.m8) -> vec.m8`
-- `vec.m16.interleave_even(a: vec.m16, b: mec.m16) -> vec.m16`
-- `vec.m32.interleave_even(a: vec.m32, b: mec.m32) -> vec.m32`
-- `vec.m64.interleave_even(a: vec.m64, b: mec.m64) -> vec.m64`
-- `vec.m128.interleave_even(a: vec.m128, b: mec.m128) -> vec.m128`
+- `vec.m8.interleave_even(a: vec.m8, b: vec.m8) -> vec.m8`
+- `vec.m16.interleave_even(a: vec.m16, b: vec.m16) -> vec.m16`
+- `vec.m32.interleave_even(a: vec.m32, b: vec.m32) -> vec.m32`
+- `vec.m64.interleave_even(a: vec.m64, b: vec.m64) -> vec.m64`
+- `vec.m128.interleave_even(a: vec.m128, b: vec.m128) -> vec.m128`
 
 
 ```python

From 78d967de5a751949da2f6213d183bce05a43b47d Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Mon, 14 Jun 2021 19:27:18 +0200
Subject: [PATCH 5/6] Fixed splat

---
 proposals/flexible-vectors/README.md | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index 7fd073b..76e0434 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -319,16 +319,15 @@ Inactive elements are not stored
 
 ### Splat scalar
 
-`idx` is interpreted modulo the length of the vector.
-
-- `vec.i8.splat(v: vec.v8, x: i32) -> vec.v8`
-- `vec.i16.splat(v: vec.v16, x: i32) -> vec.v16`
-- `vec.i32.splat(v: vec.v32, x: i32) -> vec.v32`
-- `vec.f32.splat(v: vec.v32, x: f32) -> vec.32`
-- `vec.i64.splat(v: vec.v64, x: i64) -> vec.v64`
-- `vec.f64.splat(v: vec.v64, x: f64) -> vec.v64`
-- `vec.v128.splat(v: vec.v128, x: v128) -> vec.v128`
-
+For `vec.i8.splat` and `vec.i16.splat`, `x` is truncated to 8 and 16 bits respectively.
+
+- `vec.i8.splat(x: i32) -> vec.v8`
+- `vec.i16.splat(x: i32) -> vec.v16`
+- `vec.i32.splat(x: i32) -> vec.v32`
+- `vec.f32.splat(x: f32) -> vec.v32`
+- `vec.i64.splat(x: i64) -> vec.v64`
+- `vec.f64.splat(x: f64) -> vec.v64`
+- `vec.v128.splat(x: v128) -> vec.v128`
 
 ### Extract lane
 

From 61dcf1238f92840e974f86ba73a7569b29c04365 Mon Sep 17 00:00:00 2001
From: Florian Lemaitre <lemaitre@users.noreply.github.com>
Date: Mon, 14 Jun 2021 19:43:50 +0200
Subject: [PATCH 6/6] Clarified addition and CMP in vec.mX.index_CMP

---
 proposals/flexible-vectors/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/proposals/flexible-vectors/README.md b/proposals/flexible-vectors/README.md
index 76e0434..b665f71 100644
--- a/proposals/flexible-vectors/README.md
+++ b/proposals/flexible-vectors/README.md
@@ -158,6 +158,8 @@ def mask.S.none():
 Returns a mask whose active lanes satisfy `(x + laneIdx) CMP n`
 CMP one of the following: `eq`, `ne`, `lt`, `le`, `gt`, `ge`
 
+The addition and the comparison are done signed, with infinite precision.
+
 - `vec.m8.index_CMP(x: i32, n: i32) -> vec.m8`
 - `vec.m16.index_CMP(x: i32, n: i32) -> vec.m16`
 - `vec.m32.index_CMP(x: i32, n: i32) -> vec.m32`