feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5 by Meghakoranga · Pull Request #1 · ngkore/cuSSL

Meghakoranga · 2026-03-01T06:28:17Z

Overview

This PR introduces cuSSL, a GPU-accelerated ML-KEM-768 offload integration for OpenSSL 3.5.0 using NVIDIA cuPQC.

The goal is to see ML-KEM performance under high TLS handshake workloads by leveraging GPU parallelism while preserving OpenSSL compatibility and CPU fallback behavior.

The integration is patch-based and does not modify OpenSSL public APIs.

Feedback and review suggestions are welcome.

…patch

AdityaKoranga · 2026-03-30T16:30:31Z

+    if (env && strcmp(env, "1") == 0) {
+        
+        // Register callbacks ONCE (Lazy Init)
+        if (!cupqc_callbacks_registered) {


issue (blocking): Callback registration has a check-then-act race condition. In a multi-threaded server, two threads can simultaneously observe cupqc_callbacks_registered == 0 and both call cupqc_set_callbacks(). There is no mutex protecting this block. Use pthread_once() or CRYPTO_THREAD_write_lock() to ensure the registration happens exactly once.

AdityaKoranga · 2026-03-30T16:32:21Z

+        
+        // Register callbacks ONCE (Lazy Init)
+        if (!cupqc_callbacks_registered) {
+            cupqc_set_callbacks(wrapper_pause,NULL, wrapper_get_job);


issue (blocking): The wake callback is passed as NULL, which causes a deadlock. The runtime uses cb_wake_job to resume a paused OpenSSL async job after GPU completion. With it NULL, any job that calls ASYNC_pause_job() will pause indefinitely and never be woken up. A valid wake callback must be provided here. Review ASYNC_WAIT_CTX_set_wait_fd or the callback mechanism described in the QAT Engine async_job documentation as a reference for how job wakeup should be implemented.

AdityaKoranga · 2026-03-30T16:57:57Z

+     return ossl_ml_kem_encap_seed(ctext, clen, shared_secret, slen,
+                                   r, sizeof(r), key);
+ }
+/* int ossl_ml_kem_encap_rand(uint8_t *ctext, size_t clen,


nitpick (non-blocking): Dead commented-out code should not be present in a merged patch. Remove the commented-out original function body before requesting review. If you need it for reference during development, keep it in your local branch only.

AdityaKoranga · 2026-03-30T17:00:32Z

issue (blocking): There are two files implementing the same GPU encapsulation
batch function with slightly different names - cupqc_encaps_mlkem768_batch in mlkem_cupqc.cu and cupqc_encap_mlkem768_batch in cupqc_shim.cu. If both are compiled together this is a duplicate symbol error. If only one is
compiled the other is dead code. Clarify which file is authoritative,
remove the other, and ensure the function name matches exactly what
cupqc_runtime.c calls.

AdityaKoranga · 2026-03-30T17:02:33Z

+    if (count <= 0 || count > MAX_CAPACITY) return;
+
+    // A. ALLOCATION
+    if (g_d_pk == nullptr) {


issue (blocking): The lazy GPU buffer allocation has a race condition. If two
CPU threads call this function simultaneously for the first time, both will
observe g_d_pk == nullptr and both will call cudaMalloc on the same global
pointer. This results in a memory leak and undefined behaviour. Protect this
block with a mutex or use pthread_once() for one-time initialisation.

AdityaKoranga · 2026-03-30T17:03:28Z

+    for (int i = 0; i < count; i++) {
+        if (pk_ptrs[i] && rnd_ptrs[i]) {
+            memcpy(g_h_pk + (i * Encaps768::public_key_size), pk_ptrs[i], Encaps768::public_key_size);
+            memcpy(g_h_entropy + (i * Encaps768::entropy_size), rnd_ptrs[i], Encaps768::entropy_size);


issue (blocking): The runtime copies exactly 32 bytes of randomness per job
into randomness_storage[32], but the shim uses Encaps768::entropy_size as
the stride for the device entropy buffer. If Encaps768::entropy_size != 32
— which is possible depending on the cuPQC SDK version — the gather loop
writes 32 bytes but the device layout expects more, resulting in an
out-of-bounds write into the pinned host buffer. Verify the exact value of
Encaps768::entropy_size against the cuPQC SDK and align the runtime storage
size to match.

AdityaKoranga · 2026-03-30T17:04:41Z

+        cudaMalloc(&g_d_pk, MAX_CAPACITY * Encaps768::public_key_size);
+        cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size);
+        cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size);
+        cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size);
+        cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size);
+


issue (blocking): No CUDA API call in this file checks its return value. If
the GPU runs out of memory, if the kernel fails, or if the stream errors,
all failures are silent and the output buffers contain garbage or
uninitialised data. This data then gets returned to OpenSSL as a valid
shared secret or ciphertext. Every CUDA call must check its cudaError_t
return value and propagate failures back to the caller.

AdityaKoranga · 2026-03-30T17:05:35Z

+        cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size);
+        cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size);
+        cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size);
+        cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size);


suggestion (non-blocking): g_d_workspace is allocated via cudaMalloc which
returns uninitialised device memory. Some cuPQC operations may require a
zeroed workspace buffer. Add a cudaMemset(g_d_workspace, 0, ...) call
immediately after allocation to be safe.

AdityaKoranga · 2026-03-30T17:06:14Z

+
+    // C. COPY & LAUNCH
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);


suggestion (non-blocking): Creating and destroying a CUDA stream on every
batch call is expensive. The stream should be created once during
initialisation alongside the buffer allocation and reused across all batch
calls. Destroying it per-call adds unnecessary overhead on every GPU
dispatch.

AdityaKoranga · 2026-03-30T17:06:55Z

+using namespace cupqc;
+
+/* --- 1. DEFINE THE ALGORITHM --- */
+using Encaps768 = decltype(ML_KEM_768{} + Function<function::Encaps>() + Block() + BlockDim<256>());


question (non-blocking): The cuPQC documentation example uses BlockDim<128>
for ML-KEM operations. This implementation uses BlockDim<256>. Has this been
validated against the cuPQC SDK documentation for ML-KEM-768 specifically?
Not all BlockDim values may be supported — please confirm this is an
explicitly supported configuration.

AdityaKoranga · 2026-03-30T17:09:11Z

+    return NULL;
+}
+
+static void cupqc_lazy_init(void) {


issue (blocking): cupqc_lazy_init() is not thread-safe. Multiple threads
calling cupqc_submit_encap_job() simultaneously for the first time will all
pass the if (!cupqc_initialized) check and all call pthread_create(),
spawning multiple worker threads. Replace the manual flag check with
pthread_once() to guarantee exactly one initialisation.

AdityaKoranga · 2026-03-30T17:13:25Z

+    void *current_job = (cb_get_curr_job) ? cb_get_curr_job() : NULL;
+
+    if (current_job != NULL && cb_pause_job != NULL) {
+        cb_pause_job();
+    } else {
+        pthread_mutex_lock(&global_queue.lock);
+        while (global_queue.jobs[slot].status == 0) {
+            /* FIX #1: Removed redundant signal. Just wait. */
+            pthread_cond_wait(&global_queue.cond_done, &global_queue.lock);
+        }
+        pthread_mutex_unlock(&global_queue.lock);
+    }


issue (blocking): The slot-based coordination has a use-after-reuse bug.
After pthread_mutex_unlock(), the worker thread can process the batch, reset
global_queue.count = 0, and immediately accept new jobs into slot 0. A new
submission can overwrite jobs[slot] — including ciphertext_out and
shared_secret_out pointers — while the original submitter is still waiting
on jobs[slot].status. This can cause the wrong output pointers to be written
to and produces incorrect ciphertext or shared secret data. The slot
coordination scheme needs a per-job generation counter or a per-job
condition variable to be correct.

AdityaKoranga · 2026-03-30T17:13:47Z

+            }
+        }
+
+        pthread_cond_broadcast(&global_queue.cond_done);


issue (blocking): Broadcasting on cond_done wakes all waiting threads
simultaneously. Because slots are reused after global_queue.count is reset
to 0, some threads will read stale status values from slots that have been
overwritten by new jobs. This compounds the slot reuse bug above. Each
waiting thread needs an unambiguous way to know its specific job completed,
not just that some batch finished.

AdityaKoranga · 2026-03-30T17:14:31Z

+    if (global_queue.count >= 1) { 
+        pthread_cond_signal(&global_queue.cond);
+    }


issue (blocking): The batch fires as soon as a single job arrives, which
defeats the entire purpose of batching. The GPU is being invoked with
count=1 on every call, making the per-call overhead of stream creation,
H2D transfer, kernel launch, D2H transfer, and stream destruction worse than
the CPU fallback for any realistic single-connection workload. The README
acknowledges this as a known limitation. A configurable flush threshold or a
timer-based flush is needed for the batching to provide any throughput
benefit.

AdityaKoranga · 2026-03-30T17:16:32Z

in cupqc_runtime.c file:

suggestion (non-blocking): There is no cupqc_shutdown() function, no atexit()
handler, and no way to set global_queue.shutdown = 1 from outside the
worker. The worker thread leaks on process exit and the GPU buffers are never
freed. For a production library this matters — add a cleanup function and
register it via atexit() or expose it for the caller to invoke.

Co-authored-by: AddyTiv <adityakoranga2004@gmail.com>

AdityaKoranga · 2026-04-17T12:25:58Z

Any updates here @Meghakoranga ?

Meghakoranga and others added 7 commits February 2, 2026 12:09

Add cuSSL core structure and ML-KEM GPU backend

39053c5

Add cuSSL runtime, CUDA backend, ML-KEM GPU integration, and OpenSSL …

c4e86a8

…patch

Add ML-KEM GPU integration, and OpenSSL patch

79a034e

Update README.md

d14292f

Fix patch: properly add cupqc_batch.h as new file using /dev/null

91cb0ce

Update README.md

5b89496

Improve runtime batching logic for nginx TLS workload

8e73927

AdityaKoranga changed the title ~~Add NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5~~ feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5 Mar 1, 2026

AdityaKoranga self-requested a review March 9, 2026 06:42

AdityaKoranga reviewed Mar 30, 2026

View reviewed changes

Comment thread openssl/patches/openssl-3.5.0-mlkem-cupqc.patch

AdityaKoranga reviewed Mar 30, 2026

View reviewed changes

Comment thread README.md Outdated

Update README.md

ce9664f

Co-authored-by: AddyTiv <adityakoranga2004@gmail.com>

Conversation

Meghakoranga commented Mar 1, 2026 • edited by AdityaKoranga Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

Uh oh!

AdityaKoranga Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdityaKoranga Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdityaKoranga Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdityaKoranga commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdityaKoranga commented Mar 30, 2026

Uh oh!

Uh oh!

AdityaKoranga commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Meghakoranga commented Mar 1, 2026 •

edited by AdityaKoranga

Loading

AdityaKoranga Mar 30, 2026 •

edited

Loading

AdityaKoranga Mar 30, 2026 •

edited

Loading

AdityaKoranga Mar 30, 2026 •

edited

Loading