-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Issue automatically imported from old repo: EmbarkStudios/rust-gpu#249
Old labels: t: design,mcp: rfc needed
Originally creatd by charles-r-earp on 2020-11-18T05:30:29Z
Summary
This proposal attempts to address #232, #180, and #8. StorageBuffers, aka Buffer Blocks, are the primary inputs / outputs to compute shaders, though they can also be used in other stages. Iterators are one of Rust's highlights, and I propose emulating them in gpu code.
Example: Scaled Add aka Saxpy
// Rust Cpu
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
for (x, mut y) in x.iter().copied().zip(y) {
*y += alpha * x;
}
}
// OpenCL
__kernel void scaled_add(__global const float* x, __global float* y, float alpha, uint n) {
uint gid = get_global_id(0);
if gid < n {
y[gid] += alpha * x[gid];
}
}
// GLSL
#version 450
layout(set=0, binding=0) buffer Input {
float x[];
}
layout(set=0, binding=1) buffer Output {
float y[];
}
layout(push_constant) uniform PushConsts {
float alpha;
uint n;
}
void main() {
uint gid = gl_GlobalInvocationID.x;
if gid < n {
y[gid] += alpha * x[gid];
}
}
Saxpy is trivially parallel, that is, it can be separated into n independent operations. There are plenty of other similar kinds of operations common in CUDA / OpenCL code, which do not require any synchronization / barriers. In fact, this is probably the most common case.
Possible Naive Implementation in rust-gpu
#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: Buffer<[f32]>, mut y: BufferMut<[f32]>, alpha: f32, n: u32) {
let gid = spirv_std::global_x();
if gid < n {
unsafe {
*y.get_mut_unchecked(gid) = x.get_unchecked(gid);
}
}
}
This is the most straightforward translation of the above to rust gpu code. If we neglect concerns about aliasing with other shaders, then the only potential failure mode would be that the user provided n is outside the bounds of either x or y. The programmer is responsible, by using unsafe, to ensure no aliasing within the shader.
The rayon crate allows for easy parallel processing on cpu's, with its ParallelIterator trait. It looks like this:
// rayon
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
use rayon::iter::ParallelIterator;
for (mut y, x) in y.par_iter_mut().zip(x.iter().copied()) {
*y += alpha * x;
}
}
Rayon divides the work into n parts, where n is the number of workers. It knows that partioning a slice, even a mutable one, is safe. I propose a similar api for rust-gpu.
Proposal
Buffers: Runtime Arrays of T: Copy
- GlobalBuffer: StorageBuffer
- GlobalBufferMut: StorageBuffer, mutable
- GroupBufferMut: Workgroup, mutable
Arrays: Like buffers, but with a const size, either const generics or array type ie <T = f32, const N: usize = 1024>, or <T = [f32; 1024]>
- GlobalArray: StorageBuffer
- GlobalArrayMut: StorageBuffer, mutable
- GroupArrayMut: Workgroup, mutable
Matrices!?? Would probably want const generics, ie Matrix2<T, D1 = 100, D2 = 64>
- An abstraction over Array, maybe even Buffer
- Useful for iterating over dimensions, see ndarray
- If we express the access as an iterator, then we can prevent aliases or perform appropriate synchronization
- Long term idea
Blocks: A single T struct item, T: Copy
- GlobalBlock: StorageBuffer
- load() -> T
- Potentially borrow() / as_ref() -> &T
If necessary, mutable Block fields could be accomplished via some sort of Mutex or ArcCell equivalent, but that would require relaxing the Copy requirement.
Arrays and Blocks are safer because the runtime can validate the inputs prior to launching. Start with blocks, then arrays, then buffers. Note that all of these require a special "Block" decorated struct wrapper, at least per SPIR-V specification. Slices are also tricky / not allowed in exports because of Rust's unstable abi.
Iterators
GlobalIterator trait
- Like rayon::iter::ParallelIterator
- Divides the work into global size pieces, and each invocation works on one piece
- fn for_each(self, f: impl Fn(Self::Item))
- fn enumerate()
- fn enumerate_xyz() Same as enumerate, but provides the coordinates rather than the index
- fn zip() for iterating over multiple buffers
- others from Iterator if possible
- some sort of hidden drive / next function
- unsafe to implement, may even be sealed
GroupIterator trait
- Like GlobalIterator, but iterates over local size
*Iter's are like slice::Iter, they iterate over a borrow.
IntoGlobalIterator / IntoGroupIterator
*IntoIter's: consume their container, but still yield borrowed values. This allows them to mutate the output, but they are consumed from the scope of the shader.
Unsafe
Buffers, Arrays, may have unsafe access to their internal Slice. It may not always be possible to prove that a program is safe, and it will take time to implement enough safe wrappers to fit every need.
Likewise, access to invocation specific values, like the global xyz, must be unsafe or even not allowed at all to ensure that the safe GlobalIterator construct is in fact safe. This means that the shader cannot get the u32 value, but could say, manipulate it mathmatically, and index a Buffer / slice / or pointer with it (this would require unsafe). The key thing is that it can't be read and it can't be used in control flow outside of the inner closure of for_each. This could be implemented as a wrapper, ie Idx(u32), which implements the appropriate traits.
Barriers
Barriers are emitted by *Iter and *IterMut iterators as required. IntoIter's should be able to ommit barriers, since they consume their inputs, so that they cannot be read / written to again within the shader.
Globally Const vs Per Invocation
For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).
Single Invocation operations
In some cases, it may be necessary to have only one invocation peform some work. This could potentially handled with a special SingleGlobalBufferIntoIter or the like.
Putting it all together
#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
y.into_global_iter_mut().zip(x)
.for_each(|(mut y, x)| *y += alpha * x);
}