feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5#1
feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5#1Meghakoranga wants to merge 8 commits intongkore:mainfrom
Conversation
| + if (env && strcmp(env, "1") == 0) { | ||
| + | ||
| + // Register callbacks ONCE (Lazy Init) | ||
| + if (!cupqc_callbacks_registered) { |
There was a problem hiding this comment.
issue (blocking): Callback registration has a check-then-act race condition. In a multi-threaded server, two threads can simultaneously observe cupqc_callbacks_registered == 0 and both call cupqc_set_callbacks(). There is no mutex protecting this block. Use pthread_once() or CRYPTO_THREAD_write_lock() to ensure the registration happens exactly once.
| + | ||
| + // Register callbacks ONCE (Lazy Init) | ||
| + if (!cupqc_callbacks_registered) { | ||
| + cupqc_set_callbacks(wrapper_pause,NULL, wrapper_get_job); |
There was a problem hiding this comment.
issue (blocking): The wake callback is passed as NULL, which causes a deadlock. The runtime uses cb_wake_job to resume a paused OpenSSL async job after GPU completion. With it NULL, any job that calls ASYNC_pause_job() will pause indefinitely and never be woken up. A valid wake callback must be provided here. Review ASYNC_WAIT_CTX_set_wait_fd or the callback mechanism described in the QAT Engine async_job documentation as a reference for how job wakeup should be implemented.
| return ossl_ml_kem_encap_seed(ctext, clen, shared_secret, slen, | ||
| r, sizeof(r), key); | ||
| } | ||
| +/* int ossl_ml_kem_encap_rand(uint8_t *ctext, size_t clen, |
There was a problem hiding this comment.
nitpick (non-blocking): Dead commented-out code should not be present in a merged patch. Remove the commented-out original function body before requesting review. If you need it for reference during development, keep it in your local branch only.
|
issue (blocking): There are two files implementing the same GPU encapsulation |
| if (count <= 0 || count > MAX_CAPACITY) return; | ||
|
|
||
| // A. ALLOCATION | ||
| if (g_d_pk == nullptr) { |
There was a problem hiding this comment.
issue (blocking): The lazy GPU buffer allocation has a race condition. If two
CPU threads call this function simultaneously for the first time, both will
observe g_d_pk == nullptr and both will call cudaMalloc on the same global
pointer. This results in a memory leak and undefined behaviour. Protect this
block with a mutex or use pthread_once() for one-time initialisation.
| for (int i = 0; i < count; i++) { | ||
| if (pk_ptrs[i] && rnd_ptrs[i]) { | ||
| memcpy(g_h_pk + (i * Encaps768::public_key_size), pk_ptrs[i], Encaps768::public_key_size); | ||
| memcpy(g_h_entropy + (i * Encaps768::entropy_size), rnd_ptrs[i], Encaps768::entropy_size); |
There was a problem hiding this comment.
issue (blocking): The runtime copies exactly 32 bytes of randomness per job
into randomness_storage[32], but the shim uses Encaps768::entropy_size as
the stride for the device entropy buffer. If Encaps768::entropy_size != 32
— which is possible depending on the cuPQC SDK version — the gather loop
writes 32 bytes but the device layout expects more, resulting in an
out-of-bounds write into the pinned host buffer. Verify the exact value of
Encaps768::entropy_size against the cuPQC SDK and align the runtime storage
size to match.
| cudaMalloc(&g_d_pk, MAX_CAPACITY * Encaps768::public_key_size); | ||
| cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size); | ||
| cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size); | ||
| cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size); | ||
| cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size); | ||
|
|
There was a problem hiding this comment.
issue (blocking): No CUDA API call in this file checks its return value. If
the GPU runs out of memory, if the kernel fails, or if the stream errors,
all failures are silent and the output buffers contain garbage or
uninitialised data. This data then gets returned to OpenSSL as a valid
shared secret or ciphertext. Every CUDA call must check its cudaError_t
return value and propagate failures back to the caller.
| cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size); | ||
| cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size); | ||
| cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size); | ||
| cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size); |
There was a problem hiding this comment.
suggestion (non-blocking): g_d_workspace is allocated via cudaMalloc which
returns uninitialised device memory. Some cuPQC operations may require a
zeroed workspace buffer. Add a cudaMemset(g_d_workspace, 0, ...) call
immediately after allocation to be safe.
|
|
||
| // C. COPY & LAUNCH | ||
| cudaStream_t stream; | ||
| cudaStreamCreate(&stream); |
There was a problem hiding this comment.
suggestion (non-blocking): Creating and destroying a CUDA stream on every
batch call is expensive. The stream should be created once during
initialisation alongside the buffer allocation and reused across all batch
calls. Destroying it per-call adds unnecessary overhead on every GPU
dispatch.
| using namespace cupqc; | ||
|
|
||
| /* --- 1. DEFINE THE ALGORITHM --- */ | ||
| using Encaps768 = decltype(ML_KEM_768{} + Function<function::Encaps>() + Block() + BlockDim<256>()); |
There was a problem hiding this comment.
question (non-blocking): The cuPQC documentation example uses BlockDim<128>
for ML-KEM operations. This implementation uses BlockDim<256>. Has this been
validated against the cuPQC SDK documentation for ML-KEM-768 specifically?
Not all BlockDim values may be supported — please confirm this is an
explicitly supported configuration.
| return NULL; | ||
| } | ||
|
|
||
| static void cupqc_lazy_init(void) { |
There was a problem hiding this comment.
issue (blocking): cupqc_lazy_init() is not thread-safe. Multiple threads
calling cupqc_submit_encap_job() simultaneously for the first time will all
pass the if (!cupqc_initialized) check and all call pthread_create(),
spawning multiple worker threads. Replace the manual flag check with
pthread_once() to guarantee exactly one initialisation.
| void *current_job = (cb_get_curr_job) ? cb_get_curr_job() : NULL; | ||
|
|
||
| if (current_job != NULL && cb_pause_job != NULL) { | ||
| cb_pause_job(); | ||
| } else { | ||
| pthread_mutex_lock(&global_queue.lock); | ||
| while (global_queue.jobs[slot].status == 0) { | ||
| /* FIX #1: Removed redundant signal. Just wait. */ | ||
| pthread_cond_wait(&global_queue.cond_done, &global_queue.lock); | ||
| } | ||
| pthread_mutex_unlock(&global_queue.lock); | ||
| } |
There was a problem hiding this comment.
issue (blocking): The slot-based coordination has a use-after-reuse bug.
After pthread_mutex_unlock(), the worker thread can process the batch, reset
global_queue.count = 0, and immediately accept new jobs into slot 0. A new
submission can overwrite jobs[slot] — including ciphertext_out and
shared_secret_out pointers — while the original submitter is still waiting
on jobs[slot].status. This can cause the wrong output pointers to be written
to and produces incorrect ciphertext or shared secret data. The slot
coordination scheme needs a per-job generation counter or a per-job
condition variable to be correct.
| } | ||
| } | ||
|
|
||
| pthread_cond_broadcast(&global_queue.cond_done); |
There was a problem hiding this comment.
issue (blocking): Broadcasting on cond_done wakes all waiting threads
simultaneously. Because slots are reused after global_queue.count is reset
to 0, some threads will read stale status values from slots that have been
overwritten by new jobs. This compounds the slot reuse bug above. Each
waiting thread needs an unambiguous way to know its specific job completed,
not just that some batch finished.
| if (global_queue.count >= 1) { | ||
| pthread_cond_signal(&global_queue.cond); | ||
| } |
There was a problem hiding this comment.
issue (blocking): The batch fires as soon as a single job arrives, which
defeats the entire purpose of batching. The GPU is being invoked with
count=1 on every call, making the per-call overhead of stream creation,
H2D transfer, kernel launch, D2H transfer, and stream destruction worse than
the CPU fallback for any realistic single-connection workload. The README
acknowledges this as a known limitation. A configurable flush threshold or a
timer-based flush is needed for the batching to provide any throughput
benefit.
|
in cupqc_runtime.c file: suggestion (non-blocking): There is no cupqc_shutdown() function, no atexit() |
Co-authored-by: AddyTiv <adityakoranga2004@gmail.com>
|
Any updates here @Meghakoranga ? |
Overview
This PR introduces cuSSL, a GPU-accelerated ML-KEM-768 offload integration for OpenSSL 3.5.0 using NVIDIA cuPQC.
The goal is to see ML-KEM performance under high TLS handshake workloads by leveraging GPU parallelism while preserving OpenSSL compatibility and CPU fallback behavior.
The integration is patch-based and does not modify OpenSSL public APIs.
Feedback and review suggestions are welcome.