[Migrated] API Design: StorageBuffers

_Issue automatically imported from old repo: https://github.com/EmbarkStudios/rust-gpu/issues/249_
_Old labels: t: design,mcp: rfc needed_
_Originally creatd by charles-r-earp on 2020-11-18T05:30:29Z_ 

---

# Summary 
This proposal attempts to address #232, #180, and #8. StorageBuffers, aka Buffer Blocks, are the primary inputs / outputs to compute shaders, though they can also be used in other stages. Iterators are one of Rust's highlights, and I propose emulating them in gpu code. 

# Example: Scaled Add aka Saxpy
```
// Rust Cpu
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    for (x, mut y) in x.iter().copied().zip(y) {
        *y += alpha * x;
    }
}

// OpenCL 
__kernel void scaled_add(__global const float* x, __global float* y, float alpha, uint n) {
    uint gid = get_global_id(0);
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

// GLSL 
#version 450

layout(set=0, binding=0) buffer Input {
    float x[];    
}

layout(set=0, binding=1) buffer Output {
    float y[];    
}

layout(push_constant) uniform PushConsts {
    float alpha;
    uint n;
}

void main() {
    uint gid = gl_GlobalInvocationID.x;
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}
```

Saxpy is trivially parallel, that is, it can be separated into n independent operations. There are plenty of other similar kinds of operations common in CUDA / OpenCL code, which do not require any synchronization / barriers. In fact, this is probably the most common case. 

## Possible Naive Implementation in rust-gpu

```
#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: Buffer<[f32]>, mut y: BufferMut<[f32]>, alpha: f32, n: u32) {
    let gid = spirv_std::global_x();
    if gid < n {
        unsafe {
            *y.get_mut_unchecked(gid) = x.get_unchecked(gid);
        }    
    } 
} 
```

This is the most straightforward translation of the above to rust gpu code. If we neglect concerns about aliasing with other shaders, then the only potential failure mode would be that the user provided n is outside the bounds of either x or y. The programmer is responsible, by using unsafe, to ensure no aliasing within the shader. 

The rayon crate allows for easy parallel processing on cpu's, with its ParallelIterator trait. It looks like this:

```
// rayon
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    use rayon::iter::ParallelIterator;
    for (mut y, x) in y.par_iter_mut().zip(x.iter().copied()) {
        *y += alpha * x;
    }
}
```

Rayon divides the work into n parts, where n is the number of workers. It knows that partioning a slice, even a mutable one, is safe. I propose a similar api for rust-gpu. 

# Proposal 

Buffers: Runtime Arrays of T: Copy
  - GlobalBuffer: StorageBuffer
  - GlobalBufferMut: StorageBuffer, mutable
  - GroupBufferMut: Workgroup, mutable
  
Arrays: Like buffers, but with a const size, either const generics or array type ie <T = f32, const N: usize = 1024>, or <T = [f32; 1024]> 
  - GlobalArray: StorageBuffer
  - GlobalArrayMut: StorageBuffer, mutable
  - GroupArrayMut: Workgroup, mutable
  
Matrices!?? Would probably want const generics, ie Matrix2<T, D1 = 100, D2 = 64>
  - An abstraction over Array, maybe even Buffer
  - Useful for iterating over dimensions, see ndarray
  - If we express the access as an iterator, then we can prevent aliases or perform appropriate synchronization
  - Long term idea

Blocks: A single T struct item, T: Copy
  - GlobalBlock: StorageBuffer
  - load() -> T
  - Potentially borrow() / as_ref() -> &T

If necessary, mutable Block fields could be accomplished via some sort of Mutex or ArcCell equivalent, but that would require relaxing the Copy requirement. 
  
Arrays and Blocks are safer because the runtime can validate the inputs prior to launching. Start with blocks, then arrays, then buffers. Note that all of these require a special "Block" decorated struct wrapper, at least per SPIR-V specification. Slices are also tricky / not allowed in exports because of Rust's unstable abi.

## Iterators

GlobalIterator trait
  - Like rayon::iter::ParallelIterator
  - Divides the work into global size pieces, and each invocation works on one piece 
  - fn for_each(self, f: impl Fn(Self::Item)) 
  - fn enumerate() 
  - fn enumerate_xyz() Same as enumerate, but provides the coordinates rather than the index
  - fn zip() for iterating over multiple buffers
  - others from Iterator if possible
  - some sort of hidden drive / next function
  - unsafe to implement, may even be sealed
  
GroupIterator trait 
  - Like GlobalIterator, but iterates over local size
  
*Iter's are like slice::Iter, they iterate over a borrow.
  
IntoGlobalIterator / IntoGroupIterator

*IntoIter's: consume their container, but still yield borrowed values. This allows them to mutate the output, but they are consumed from the scope of the shader. 

## Unsafe 
Buffers, Arrays, may have unsafe access to their internal Slice. It may not always be possible to prove that a program is safe, and it will take time to implement enough safe wrappers to fit every need. 

Likewise, access to invocation specific values, like the global xyz, must be unsafe or even not allowed at all to ensure that the safe GlobalIterator construct is in fact safe. This means that the shader cannot get the u32 value, but could say, manipulate it mathmatically, and index a Buffer / slice / or pointer with it (this would require unsafe). The key thing is that it can't be read and it can't be used in control flow outside of the inner closure of for_each. This could be implemented as a wrapper, ie Idx(u32), which implements the appropriate traits. 

## Barriers
Barriers are emitted by *Iter and *IterMut iterators as required. IntoIter's should be able to ommit barriers, since they consume their inputs, so that they cannot be read / written to again within the shader.

## Globally Const vs Per Invocation
For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

## Single Invocation operations
In some cases, it may be necessary to have only one invocation peform some work. This could potentially handled with a special SingleGlobalBufferIntoIter or the like.  

## Putting it all together 

```
#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}
```






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Migrated] API Design: StorageBuffers #138

Summary

Example: Scaled Add aka Saxpy

Possible Naive Implementation in rust-gpu

Proposal

Iterators

Unsafe

Barriers

Globally Const vs Per Invocation

Single Invocation operations

Putting it all together

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Migrated] API Design: StorageBuffers #138

Description

Summary

Example: Scaled Add aka Saxpy

Possible Naive Implementation in rust-gpu

Proposal

Iterators

Unsafe

Barriers

Globally Const vs Per Invocation

Single Invocation operations

Putting it all together

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions