Skip to content

feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5#1

Open
Meghakoranga wants to merge 8 commits intongkore:mainfrom
Meghakoranga:main
Open

feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5#1
Meghakoranga wants to merge 8 commits intongkore:mainfrom
Meghakoranga:main

Conversation

@Meghakoranga
Copy link
Copy Markdown
Member

@Meghakoranga Meghakoranga commented Mar 1, 2026

Overview

This PR introduces cuSSL, a GPU-accelerated ML-KEM-768 offload integration for OpenSSL 3.5.0 using NVIDIA cuPQC.

The goal is to see ML-KEM performance under high TLS handshake workloads by leveraging GPU parallelism while preserving OpenSSL compatibility and CPU fallback behavior.

The integration is patch-based and does not modify OpenSSL public APIs.


Feedback and review suggestions are welcome.

@AdityaKoranga AdityaKoranga changed the title Add NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5 feat: NVIDIA GPU-Accelerated ML-KEM-768 Offload for OpenSSL 3.5 Mar 1, 2026
@AdityaKoranga AdityaKoranga self-requested a review March 9, 2026 06:42
Comment thread openssl/patches/openssl-3.5.0-mlkem-cupqc.patch
+ if (env && strcmp(env, "1") == 0) {
+
+ // Register callbacks ONCE (Lazy Init)
+ if (!cupqc_callbacks_registered) {
Copy link
Copy Markdown
Member

@AdityaKoranga AdityaKoranga Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Callback registration has a check-then-act race condition. In a multi-threaded server, two threads can simultaneously observe cupqc_callbacks_registered == 0 and both call cupqc_set_callbacks(). There is no mutex protecting this block. Use pthread_once() or CRYPTO_THREAD_write_lock() to ensure the registration happens exactly once.

+
+ // Register callbacks ONCE (Lazy Init)
+ if (!cupqc_callbacks_registered) {
+ cupqc_set_callbacks(wrapper_pause,NULL, wrapper_get_job);
Copy link
Copy Markdown
Member

@AdityaKoranga AdityaKoranga Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The wake callback is passed as NULL, which causes a deadlock. The runtime uses cb_wake_job to resume a paused OpenSSL async job after GPU completion. With it NULL, any job that calls ASYNC_pause_job() will pause indefinitely and never be woken up. A valid wake callback must be provided here. Review ASYNC_WAIT_CTX_set_wait_fd or the callback mechanism described in the QAT Engine async_job documentation as a reference for how job wakeup should be implemented.

return ossl_ml_kem_encap_seed(ctext, clen, shared_secret, slen,
r, sizeof(r), key);
}
+/* int ossl_ml_kem_encap_rand(uint8_t *ctext, size_t clen,
Copy link
Copy Markdown
Member

@AdityaKoranga AdityaKoranga Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (non-blocking): Dead commented-out code should not be present in a merged patch. Remove the commented-out original function body before requesting review. If you need it for reference during development, keep it in your local branch only.

@AdityaKoranga
Copy link
Copy Markdown
Member

issue (blocking): There are two files implementing the same GPU encapsulation
batch function with slightly different names - cupqc_encaps_mlkem768_batch in mlkem_cupqc.cu and cupqc_encap_mlkem768_batch in cupqc_shim.cu. If both are compiled together this is a duplicate symbol error. If only one is
compiled the other is dead code. Clarify which file is authoritative,
remove the other, and ensure the function name matches exactly what
cupqc_runtime.c calls.

Comment thread src/cupqc_shim.cu
if (count <= 0 || count > MAX_CAPACITY) return;

// A. ALLOCATION
if (g_d_pk == nullptr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The lazy GPU buffer allocation has a race condition. If two
CPU threads call this function simultaneously for the first time, both will
observe g_d_pk == nullptr and both will call cudaMalloc on the same global
pointer. This results in a memory leak and undefined behaviour. Protect this
block with a mutex or use pthread_once() for one-time initialisation.

Comment thread src/cupqc_shim.cu
for (int i = 0; i < count; i++) {
if (pk_ptrs[i] && rnd_ptrs[i]) {
memcpy(g_h_pk + (i * Encaps768::public_key_size), pk_ptrs[i], Encaps768::public_key_size);
memcpy(g_h_entropy + (i * Encaps768::entropy_size), rnd_ptrs[i], Encaps768::entropy_size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The runtime copies exactly 32 bytes of randomness per job
into randomness_storage[32], but the shim uses Encaps768::entropy_size as
the stride for the device entropy buffer. If Encaps768::entropy_size != 32
— which is possible depending on the cuPQC SDK version — the gather loop
writes 32 bytes but the device layout expects more, resulting in an
out-of-bounds write into the pinned host buffer. Verify the exact value of
Encaps768::entropy_size against the cuPQC SDK and align the runtime storage
size to match.

Comment thread src/cupqc_shim.cu
Comment on lines +62 to +67
cudaMalloc(&g_d_pk, MAX_CAPACITY * Encaps768::public_key_size);
cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size);
cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size);
cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size);
cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): No CUDA API call in this file checks its return value. If
the GPU runs out of memory, if the kernel fails, or if the stream errors,
all failures are silent and the output buffers contain garbage or
uninitialised data. This data then gets returned to OpenSSL as a valid
shared secret or ciphertext. Every CUDA call must check its cudaError_t
return value and propagate failures back to the caller.

Comment thread src/cupqc_shim.cu
cudaMalloc(&g_d_ct, MAX_CAPACITY * Encaps768::ciphertext_size);
cudaMalloc(&g_d_ss, MAX_CAPACITY * Encaps768::shared_secret_size);
cudaMalloc(&g_d_entropy, MAX_CAPACITY * Encaps768::entropy_size);
cudaMalloc(&g_d_workspace, MAX_CAPACITY * Encaps768::workspace_size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): g_d_workspace is allocated via cudaMalloc which
returns uninitialised device memory. Some cuPQC operations may require a
zeroed workspace buffer. Add a cudaMemset(g_d_workspace, 0, ...) call
immediately after allocation to be safe.

Comment thread src/cupqc_shim.cu

// C. COPY & LAUNCH
cudaStream_t stream;
cudaStreamCreate(&stream);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Creating and destroying a CUDA stream on every
batch call is expensive. The stream should be created once during
initialisation alongside the buffer allocation and reused across all batch
calls. Destroying it per-call adds unnecessary overhead on every GPU
dispatch.

Comment thread src/cupqc_shim.cu
using namespace cupqc;

/* --- 1. DEFINE THE ALGORITHM --- */
using Encaps768 = decltype(ML_KEM_768{} + Function<function::Encaps>() + Block() + BlockDim<256>());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): The cuPQC documentation example uses BlockDim<128>
for ML-KEM operations. This implementation uses BlockDim<256>. Has this been
validated against the cuPQC SDK documentation for ML-KEM-768 specifically?
Not all BlockDim values may be supported — please confirm this is an
explicitly supported configuration.

Comment thread src/cupqc_runtime.c
return NULL;
}

static void cupqc_lazy_init(void) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): cupqc_lazy_init() is not thread-safe. Multiple threads
calling cupqc_submit_encap_job() simultaneously for the first time will all
pass the if (!cupqc_initialized) check and all call pthread_create(),
spawning multiple worker threads. Replace the manual flag check with
pthread_once() to guarantee exactly one initialisation.

Comment thread src/cupqc_runtime.c
Comment on lines +160 to +171
void *current_job = (cb_get_curr_job) ? cb_get_curr_job() : NULL;

if (current_job != NULL && cb_pause_job != NULL) {
cb_pause_job();
} else {
pthread_mutex_lock(&global_queue.lock);
while (global_queue.jobs[slot].status == 0) {
/* FIX #1: Removed redundant signal. Just wait. */
pthread_cond_wait(&global_queue.cond_done, &global_queue.lock);
}
pthread_mutex_unlock(&global_queue.lock);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The slot-based coordination has a use-after-reuse bug.
After pthread_mutex_unlock(), the worker thread can process the batch, reset
global_queue.count = 0, and immediately accept new jobs into slot 0. A new
submission can overwrite jobs[slot] — including ciphertext_out and
shared_secret_out pointers — while the original submitter is still waiting
on jobs[slot].status. This can cause the wrong output pointers to be written
to and produces incorrect ciphertext or shared secret data. The slot
coordination scheme needs a per-job generation counter or a per-job
condition variable to be correct.

Comment thread src/cupqc_runtime.c
}
}

pthread_cond_broadcast(&global_queue.cond_done);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): Broadcasting on cond_done wakes all waiting threads
simultaneously. Because slots are reused after global_queue.count is reset
to 0, some threads will read stale status values from slots that have been
overwritten by new jobs. This compounds the slot reuse bug above. Each
waiting thread needs an unambiguous way to know its specific job completed,
not just that some batch finished.

Comment thread src/cupqc_runtime.c
Comment on lines +153 to +155
if (global_queue.count >= 1) {
pthread_cond_signal(&global_queue.cond);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): The batch fires as soon as a single job arrives, which
defeats the entire purpose of batching. The GPU is being invoked with
count=1 on every call, making the per-call overhead of stream creation,
H2D transfer, kernel launch, D2H transfer, and stream destruction worse than
the CPU fallback for any realistic single-connection workload. The README
acknowledges this as a known limitation. A configurable flush threshold or a
timer-based flush is needed for the batching to provide any throughput
benefit.

@AdityaKoranga
Copy link
Copy Markdown
Member

in cupqc_runtime.c file:

suggestion (non-blocking): There is no cupqc_shutdown() function, no atexit()
handler, and no way to set global_queue.shutdown = 1 from outside the
worker. The worker thread leaks on process exit and the GPU buffers are never
freed. For a production library this matters — add a cleanup function and
register it via atexit() or expose it for the caller to invoke.

Comment thread README.md Outdated
Co-authored-by: AddyTiv <adityakoranga2004@gmail.com>
@AdityaKoranga
Copy link
Copy Markdown
Member

Any updates here @Meghakoranga ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants