Similar to time_lower and soon, batch_lower, another operation num_lower could be added in the case where the coords are Float32s, to make sure that all numbers in a given kernel stay Float32. This is needed for Metal.jl, which does not support 64-bit floats, and this will also offer maximal performance for SIMD and CUDA, as it isn't converting between 64-bit floats (overflows vector register) and 32-bit floats.
Also some work will be needed to handle constants in kernels which are Float64
Similar to
time_lowerand soon,batch_lower, another operationnum_lowercould be added in the case where the coords areFloat32s, to make sure that all numbers in a given kernel stayFloat32. This is needed for Metal.jl, which does not support 64-bit floats, and this will also offer maximal performance for SIMD and CUDA, as it isn't converting between 64-bit floats (overflows vector register) and 32-bit floats.Also some work will be needed to handle constants in kernels which are
Float64