Skip to content

[Migrated] API Design: StorageBuffersΒ #138

@rust-gpu-bot

Description

@rust-gpu-bot

Issue automatically imported from old repo: EmbarkStudios/rust-gpu#249
Old labels: t: design,mcp: rfc needed
Originally creatd by charles-r-earp on 2020-11-18T05:30:29Z


Summary

This proposal attempts to address #232, #180, and #8. StorageBuffers, aka Buffer Blocks, are the primary inputs / outputs to compute shaders, though they can also be used in other stages. Iterators are one of Rust's highlights, and I propose emulating them in gpu code.

Example: Scaled Add aka Saxpy

// Rust Cpu
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    for (x, mut y) in x.iter().copied().zip(y) {
        *y += alpha * x;
    }
}

// OpenCL 
__kernel void scaled_add(__global const float* x, __global float* y, float alpha, uint n) {
    uint gid = get_global_id(0);
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

// GLSL 
#version 450

layout(set=0, binding=0) buffer Input {
    float x[];    
}

layout(set=0, binding=1) buffer Output {
    float y[];    
}

layout(push_constant) uniform PushConsts {
    float alpha;
    uint n;
}

void main() {
    uint gid = gl_GlobalInvocationID.x;
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

Saxpy is trivially parallel, that is, it can be separated into n independent operations. There are plenty of other similar kinds of operations common in CUDA / OpenCL code, which do not require any synchronization / barriers. In fact, this is probably the most common case.

Possible Naive Implementation in rust-gpu

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: Buffer<[f32]>, mut y: BufferMut<[f32]>, alpha: f32, n: u32) {
    let gid = spirv_std::global_x();
    if gid < n {
        unsafe {
            *y.get_mut_unchecked(gid) = x.get_unchecked(gid);
        }    
    } 
} 

This is the most straightforward translation of the above to rust gpu code. If we neglect concerns about aliasing with other shaders, then the only potential failure mode would be that the user provided n is outside the bounds of either x or y. The programmer is responsible, by using unsafe, to ensure no aliasing within the shader.

The rayon crate allows for easy parallel processing on cpu's, with its ParallelIterator trait. It looks like this:

// rayon
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    use rayon::iter::ParallelIterator;
    for (mut y, x) in y.par_iter_mut().zip(x.iter().copied()) {
        *y += alpha * x;
    }
}

Rayon divides the work into n parts, where n is the number of workers. It knows that partioning a slice, even a mutable one, is safe. I propose a similar api for rust-gpu.

Proposal

Buffers: Runtime Arrays of T: Copy

  • GlobalBuffer: StorageBuffer
  • GlobalBufferMut: StorageBuffer, mutable
  • GroupBufferMut: Workgroup, mutable

Arrays: Like buffers, but with a const size, either const generics or array type ie <T = f32, const N: usize = 1024>, or <T = [f32; 1024]>

  • GlobalArray: StorageBuffer
  • GlobalArrayMut: StorageBuffer, mutable
  • GroupArrayMut: Workgroup, mutable

Matrices!?? Would probably want const generics, ie Matrix2<T, D1 = 100, D2 = 64>

  • An abstraction over Array, maybe even Buffer
  • Useful for iterating over dimensions, see ndarray
  • If we express the access as an iterator, then we can prevent aliases or perform appropriate synchronization
  • Long term idea

Blocks: A single T struct item, T: Copy

  • GlobalBlock: StorageBuffer
  • load() -> T
  • Potentially borrow() / as_ref() -> &T

If necessary, mutable Block fields could be accomplished via some sort of Mutex or ArcCell equivalent, but that would require relaxing the Copy requirement.

Arrays and Blocks are safer because the runtime can validate the inputs prior to launching. Start with blocks, then arrays, then buffers. Note that all of these require a special "Block" decorated struct wrapper, at least per SPIR-V specification. Slices are also tricky / not allowed in exports because of Rust's unstable abi.

Iterators

GlobalIterator trait

  • Like rayon::iter::ParallelIterator
  • Divides the work into global size pieces, and each invocation works on one piece
  • fn for_each(self, f: impl Fn(Self::Item))
  • fn enumerate()
  • fn enumerate_xyz() Same as enumerate, but provides the coordinates rather than the index
  • fn zip() for iterating over multiple buffers
  • others from Iterator if possible
  • some sort of hidden drive / next function
  • unsafe to implement, may even be sealed

GroupIterator trait

  • Like GlobalIterator, but iterates over local size

*Iter's are like slice::Iter, they iterate over a borrow.

IntoGlobalIterator / IntoGroupIterator

*IntoIter's: consume their container, but still yield borrowed values. This allows them to mutate the output, but they are consumed from the scope of the shader.

Unsafe

Buffers, Arrays, may have unsafe access to their internal Slice. It may not always be possible to prove that a program is safe, and it will take time to implement enough safe wrappers to fit every need.

Likewise, access to invocation specific values, like the global xyz, must be unsafe or even not allowed at all to ensure that the safe GlobalIterator construct is in fact safe. This means that the shader cannot get the u32 value, but could say, manipulate it mathmatically, and index a Buffer / slice / or pointer with it (this would require unsafe). The key thing is that it can't be read and it can't be used in control flow outside of the inner closure of for_each. This could be implemented as a wrapper, ie Idx(u32), which implements the appropriate traits.

Barriers

Barriers are emitted by *Iter and *IterMut iterators as required. IntoIter's should be able to ommit barriers, since they consume their inputs, so that they cannot be read / written to again within the shader.

Globally Const vs Per Invocation

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Single Invocation operations

In some cases, it may be necessary to have only one invocation peform some work. This could potentially handled with a special SingleGlobalBufferIntoIter or the like.

Putting it all together

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions