ngkore · Meghakoranga · Feb 2, 2026 · Feb 14, 2026 · Feb 14, 2026 · Feb 14, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,9 @@
+*.o
+*.a
+*.so
+build/
+cupqc_sdk/
+*.pem
+*.crt
+*.key
+*.txt
diff --git a/README.md b/README.md
@@ -1,2 +1,218 @@
-# cuSSL
-cuPQC provider for OpenSSL
+# cuSSL: GPU-Accelerated ML-KEM-768 Integration for OpenSSL 3.5
+
+**Hardware-Accelerated Post-Quantum Cryptography using NVIDIA cuPQC and CUDA**
+
+cuSSL is a high-performance runtime and backend that offloads **ML-KEM-768** Key Encapsulation operations from OpenSSL 3.5 to NVIDIA GPUs. It integrates directly into the OpenSSL cryptographic core and enables high-throughput TLS 1.3 Post-Quantum handshakes.
+
+cuSSL implements a **Split-Stack Architecture**, cleanly separating the OpenSSL cryptographic core (CPU/C) from the GPU execution backend (CUDA/C++), ensuring ABI stability, thread safety, and memory isolation.
+
+---
+
+##  Features
+
+* GPU-accelerated ML-KEM-768 encapsulation using NVIDIA cuPQC
+* Asynchronous batching runtime (up to 512 concurrent operations)
+* Direct integration into OpenSSL 3.5 cryptographic core
+* Thread-safe job queue and runtime scheduler
+* Secure memory isolation between OpenSSL and GPU runtime
+* Automatic CPU fallback when GPU offload is disabled
+* Clean patch-based integration (no OpenSSL source redistribution)
+
+---
+
+## Architecture
+
+cuSSL operates in three layers:
+
+### 1. OpenSSL Integration Layer (Client)
+
+**File:** `crypto/ml_kem/ml_kem.c` (patched)
+
+Responsibilities:
+
+* Intercepts ML-KEM encapsulation requests
+* Submits jobs via cuSSL runtime API
+* Uses OpenSSL Async Job framework (`ASYNC_pause_job`)
+* Maintains full compatibility with OpenSSL execution model
+
+---
+
+### 2. cuSSL Runtime Layer (Manager)
+
+**File:** `src/cupqc_runtime.c`
+
+Responsibilities:
+
+* Thread-safe batching queue
+* Job scheduling and worker thread management
+* Memory isolation between OpenSSL and CUDA
+* Async job coordination
+
+This layer acts as the bridge between OpenSSL and GPU backend.
+
+---
+
+### 3. CUDA Backend Layer (Worker)
+
+**File:** `src/cupqc_shim.cu`
+
+Responsibilities:
+
+* Executes batched ML-KEM-768 encapsulation
+* Launches cuPQC CUDA kernels
+* Manages persistent GPU memory buffers
+* Performs host/device memory transfers
+
+Uses: cupqc::ML_KEM_768 from NVIDIA cuPQC SDK.
+
+---
+
+##  Prerequisites
+
+Hardware:
+
+* NVIDIA GPU (Turing / Ampere / Ada or newer)
+* Compute Capability ≥ 7.5
+
+Software:
+
+* Linux (Ubuntu 20.04 / 22.04 recommended)
+* OpenSSL 3.5.0 source
+* NVIDIA CUDA Toolkit (12+)
+* NVIDIA cuPQC SDK
+* GCC 9+
+* NVCC compiler
+
+---
+
+##  Build Instructions
+
+### 1. Set Environment Variables
+
+```
+export CUPQC_HOME=/path/to/cupqc_sdk
+export OPENSSL_ROOT=/path/to/openssl-3.5.0
+```
+
+---
+
+### 2. Compile cuSSL Runtime and CUDA Backend
+
+Compile runtime:
+```
+gcc -c src/cupqc_runtime.c -o cupqc_runtime.o -fPIC
+-I${OPENSSL_ROOT}/include
+-I${OPENSSL_ROOT}/crypto/ml_kem
+```
+
+Compile CUDA backend:
+
+```
+nvcc -c src/cupqc_shim.cu -o cupqc_shim.o
+-rdc=true -dlto -std=c++17
+-I${CUPQC_HOME}/include
+-Xcompiler -fPIC
+```
+
+Device link:
+```
+nvcc -dlink cupqc_shim.o -o cupqc_shim_dlink.o
+-rdc=true -dlto
+-L${CUPQC_HOME}/lib -lcupqc-pk
+```
+
+Final shared library:
+```
+g++ -shared -o libcussl.so
+cupqc_runtime.o cupqc_shim.o cupqc_shim_dlink.o
+-L${CUPQC_HOME}/lib -lcupqc-pk
+-L/usr/local/cuda/lib64 -lcudart -lpthread
+```
+
+---
+
+### 3. Apply OpenSSL Patch
+
+From OpenSSL root:
+
+```
+patch -p1 < /path/to/cuSSL/openssl/patches/openssl-3.5.0-mlkem-cupqc.patch
+```
+
+Rebuild OpenSSL:
+
+```
+make -j$(nproc)
+```
+---
+
+##  Usage
+
+Enable GPU offload:```export ENABLE_CUPQC=1```
+
+Run OpenSSL TLS server:
+
+```
+openssl s_server -accept 4433 -cert cert.pem -key key.pem -tls1_3 -groups mlkem768
+```
+## Verify Offload
+Use ```nvitop``` or ```nvidia-smi``` to verify GPU utilization during handshakes
+
+--- 
+**Disable GPU offload**:```unset ENABLE_CUPQC```
+
+OpenSSL will fall back to CPU implementation automatically.
+
+---
+
+## Performance & Scaling
+
+This engine offloads the heavy post-quantum math to the GPU. However, overall throughput depends heavily on the web server's architecture.
+
+### Current Benchmark (Standard Nginx)
+* **Rate:** ~500 Handshakes/Second
+* **Architecture Limit:** Standard Nginx uses a multi-processing model (e.g., 32 isolated worker processes). Because memory is not shared between these workers, the engine's internal batch queue cannot easily aggregate hundreds of connections at once. To prevent deadlocks, the GPU wake threshold is set to `1`, meaning the GPU processes very small batches, causing high CPU overhead from frequent kernel launches.
+
+### How to Scale to 2,000+ HS/s
+To fully saturate the GPU and achieve maximum throughput, the engine needs to fill its 512-slot batch queue. This can be achieved through two potential upgrades:
+
+1. **Async-Enabled Server:** Use a web server that supports OpenSSL's asynchronous features (like Intel's Async Nginx). This allows a *single* worker process to handle thousands of concurrent connections, naturally filling a single, massive GPU queue without blocking.
+2. **Background Flush Timer:** Implement a POSIX timer thread inside `cupqc_runtime.c` that forces a queue flush every few milliseconds, ensuring that "leftover" connections do not deadlock when using larger batch thresholds across multiple Nginx workers.
+
+##  Security and Compatibility
+
+cuSSL:
+
+* Preserves OpenSSL security model
+* Does not modify public OpenSSL APIs
+* Uses isolated runtime
+* Supports CPU fallback
+
+Patch-based integration ensures maintainability across OpenSSL versions.
+
+---
+
+##  Licensing
+
+This repository contains only integration code.
+
+It does NOT include:
+
+* OpenSSL source code
+* NVIDIA cuPQC SDK
+* CUDA Toolkit
+
+Users must obtain those separately under their respective licenses.
+
+---
+
+## Project Status
+
+The engine is fully functional and architecturally stable. It successfully performs hardware-offloaded ML-KEM-768 key encapsulation for standard OpenSSL TLS 1.3 connections.
+
+**Core achievements include:**
+<ul>
+<li>Correctness: Validated bit-exact key exchange and successful handshake completion.</li>
+<li>Architecture: Strict separation of OpenSSL API and GPU runtime for full library compliance.</li>
+<li>Performance: Asynchronous batching logic is implemented and operational, ready for multi-threaded deployment.</li>
+</ul>
diff --git a/benchmarks/benchmark_cpu.c b/benchmarks/benchmark_cpu.c
@@ -0,0 +1,87 @@
+/* benchmark_cpu.c - Multi-Threaded CPU Benchmark */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <pthread.h>
+#include <openssl/evp.h>
+#include <openssl/err.h>
+
+#define ALGO_NAME "ML-KEM-768"
+
+// Shared specific variables
+int g_iterations_per_thread = 0;
+EVP_PKEY *g_pkey = NULL;
+
+// The Worker Thread Function
+void *cpu_worker(void *arg) {
+    EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new(g_pkey, NULL);
+    if (!ctx) return NULL;
+
+    // We do NOT call Async init, so this runs on CPU Software path
+    EVP_PKEY_encapsulate_init(ctx, NULL);
+
+    unsigned char *secret = malloc(32);
+    unsigned char *ciphertext = malloc(1088);
+    size_t secret_len = 32;
+    size_t ciphertext_len = 1088;
+
+    for (int i = 0; i < g_iterations_per_thread; i++) {
+        secret_len = 32;
+        ciphertext_len = 1088;
+        EVP_PKEY_encapsulate(ctx, ciphertext, &ciphertext_len, secret, &secret_len);
+    }
+
+    free(secret);
+    free(ciphertext);
+    EVP_PKEY_CTX_free(ctx);
+    return NULL;
+}
+
+int main(int argc, char **argv) {
+    int num_threads = 4; // Default to 4 cores
+    int total_iters = 100000;
+
+    if (argc > 1) num_threads = atoi(argv[1]);
+    if (argc > 2) total_iters = atoi(argv[2]);
+
+    g_iterations_per_thread = total_iters / num_threads;
+
+    printf("Benchmarking Multi-Core CPU Performance\n");
+    printf("Algorithm: %s\n", ALGO_NAME);
+    printf("Threads:   %d\n", num_threads);
+    printf("Total Ops: %d\n", num_threads * g_iterations_per_thread);
+
+    // Generate Key (Once)
+    EVP_PKEY_CTX *kctx = EVP_PKEY_CTX_new_from_name(NULL, ALGO_NAME, NULL);
+    EVP_PKEY_keygen_init(kctx);
+    EVP_PKEY_keygen(kctx, &g_pkey);
+    EVP_PKEY_CTX_free(kctx);
+
+    // Launch Threads
+    pthread_t threads[num_threads];
+    clock_t start = clock();
+    struct timespec ts_start, ts_end;
+    clock_gettime(CLOCK_MONOTONIC, &ts_start);
+
+    for (int i = 0; i < num_threads; i++) {
+        pthread_create(&threads[i], NULL, cpu_worker, NULL);
+    }
+
+    // Wait for Threads
+    for (int i = 0; i < num_threads; i++) {
+        pthread_join(threads[i], NULL);
+    }
+
+    clock_gettime(CLOCK_MONOTONIC, &ts_end);
+
+    double time_spent = (ts_end.tv_sec - ts_start.tv_sec) + 
+                        (ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;
+
+    printf("\n--- CPU Results ---\n");
+    printf("Total Time: %.2f seconds\n", time_spent);
+    printf("Ops/Sec:    %.2f\n", (double)(num_threads * g_iterations_per_thread) / time_spent);
+
+    EVP_PKEY_free(g_pkey);
+    return 0;
+}