-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Describe the feature request
Thank you for a great library! We use ONNX Runtime as part of a larger machine learning workflow, of which multiple other parts run on the GPU as well. In order to maximize GPU utilization, we reuse the scratch space in GPU memory in these different steps. Would it be possible to augment the CUDA/TensorRT EP APIs with void* ptr, size_t size
(or std::span<std::byte>
) arguments such that the user could provide the GPU memory for the EP to use?
Describe scenario use case
Let's say we have the following pipeline
[A] --> [some ONNX Runtime workflow B] --> [C]
all of these stages run on the GPU. A
and C
currently use the same GPU memory for their computation, whereas the ONNX Runtime stage B
allocates its own GPU memory that it does not share. By reusing GPU memory, we can process more data in a single stage, which improves overall data throughput.