Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.o
*.a
*.so
build/
cupqc_sdk/
*.pem
*.crt
*.key
*.txt
220 changes: 218 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,218 @@
# cuSSL
cuPQC provider for OpenSSL
# cuSSL: GPU-Accelerated ML-KEM-768 Integration for OpenSSL 3.5

**Hardware-Accelerated Post-Quantum Cryptography using NVIDIA cuPQC and CUDA**

cuSSL is a high-performance runtime and backend that offloads **ML-KEM-768** Key Encapsulation operations from OpenSSL 3.5 to NVIDIA GPUs. It integrates directly into the OpenSSL cryptographic core and enables high-throughput TLS 1.3 Post-Quantum handshakes.

cuSSL implements a **Split-Stack Architecture**, cleanly separating the OpenSSL cryptographic core (CPU/C) from the GPU execution backend (CUDA/C++), ensuring ABI stability, thread safety, and memory isolation.

---

## Features

* GPU-accelerated ML-KEM-768 encapsulation using NVIDIA cuPQC
* Asynchronous batching runtime (up to 512 concurrent operations)
* Direct integration into OpenSSL 3.5 cryptographic core
* Thread-safe job queue and runtime scheduler
* Secure memory isolation between OpenSSL and GPU runtime
* Automatic CPU fallback when GPU offload is disabled
* Clean patch-based integration (no OpenSSL source redistribution)

---

## Architecture

cuSSL operates in three layers:

### 1. OpenSSL Integration Layer (Client)

**File:** `crypto/ml_kem/ml_kem.c` (patched)

Responsibilities:

* Intercepts ML-KEM encapsulation requests
* Submits jobs via cuSSL runtime API
* Uses OpenSSL Async Job framework (`ASYNC_pause_job`)
* Maintains full compatibility with OpenSSL execution model

---

### 2. cuSSL Runtime Layer (Manager)

**File:** `src/cupqc_runtime.c`

Responsibilities:

* Thread-safe batching queue
* Job scheduling and worker thread management
* Memory isolation between OpenSSL and CUDA
* Async job coordination

This layer acts as the bridge between OpenSSL and GPU backend.

---

### 3. CUDA Backend Layer (Worker)

**File:** `src/cupqc_shim.cu`

Responsibilities:

* Executes batched ML-KEM-768 encapsulation
* Launches cuPQC CUDA kernels
* Manages persistent GPU memory buffers
* Performs host/device memory transfers

Uses: cupqc::ML_KEM_768 from NVIDIA cuPQC SDK.

---

## Prerequisites

Hardware:

* NVIDIA GPU (Turing / Ampere / Ada or newer)
* Compute Capability ≥ 7.5

Software:

* Linux (Ubuntu 20.04 / 22.04 recommended)
* OpenSSL 3.5.0 source
* NVIDIA CUDA Toolkit (12+)
* NVIDIA cuPQC SDK
* GCC 9+
* NVCC compiler

---

## Build Instructions

### 1. Set Environment Variables

```
export CUPQC_HOME=/path/to/cupqc_sdk
export OPENSSL_ROOT=/path/to/openssl-3.5.0
```

---

### 2. Compile cuSSL Runtime and CUDA Backend

Compile runtime:
```
gcc -c src/cupqc_runtime.c -o cupqc_runtime.o -fPIC
-I${OPENSSL_ROOT}/include
-I${OPENSSL_ROOT}/crypto/ml_kem
```

Compile CUDA backend:

```
nvcc -c src/cupqc_shim.cu -o cupqc_shim.o
-rdc=true -dlto -std=c++17
-I${CUPQC_HOME}/include
-Xcompiler -fPIC
```

Device link:
```
nvcc -dlink cupqc_shim.o -o cupqc_shim_dlink.o
-rdc=true -dlto
-L${CUPQC_HOME}/lib -lcupqc-pk
```

Final shared library:
```
g++ -shared -o libcussl.so
cupqc_runtime.o cupqc_shim.o cupqc_shim_dlink.o
-L${CUPQC_HOME}/lib -lcupqc-pk
-L/usr/local/cuda/lib64 -lcudart -lpthread
```

---

### 3. Apply OpenSSL Patch

From OpenSSL root:

```
patch -p1 < /path/to/cuSSL/openssl/patches/openssl-3.5.0-mlkem-cupqc.patch
```

Rebuild OpenSSL:

```
make -j$(nproc)
```
---

## Usage

Enable GPU offload:```export ENABLE_CUPQC=1```

Run OpenSSL TLS server:

```
openssl s_server -accept 4433 -cert cert.pem -key key.pem -tls1_3 -groups mlkem768
```
## Verify Offload
Use ```nvitop``` or ```nvidia-smi``` to verify GPU utilization during handshakes

---
**Disable GPU offload**:```unset ENABLE_CUPQC```

OpenSSL will fall back to CPU implementation automatically.

---

## Performance & Scaling

This engine offloads the heavy post-quantum math to the GPU. However, overall throughput depends heavily on the web server's architecture.

### Current Benchmark (Standard Nginx)
* **Rate:** ~500 Handshakes/Second
* **Architecture Limit:** Standard Nginx uses a multi-processing model (e.g., 32 isolated worker processes). Because memory is not shared between these workers, the engine's internal batch queue cannot easily aggregate hundreds of connections at once. To prevent deadlocks, the GPU wake threshold is set to `1`, meaning the GPU processes very small batches, causing high CPU overhead from frequent kernel launches.

### How to Scale to 2,000+ HS/s
To fully saturate the GPU and achieve maximum throughput, the engine needs to fill its 512-slot batch queue. This can be achieved through two potential upgrades:

1. **Async-Enabled Server:** Use a web server that supports OpenSSL's asynchronous features (like Intel's Async Nginx). This allows a *single* worker process to handle thousands of concurrent connections, naturally filling a single, massive GPU queue without blocking.
2. **Background Flush Timer:** Implement a POSIX timer thread inside `cupqc_runtime.c` that forces a queue flush every few milliseconds, ensuring that "leftover" connections do not deadlock when using larger batch thresholds across multiple Nginx workers.

## Security and Compatibility

cuSSL:

* Preserves OpenSSL security model
* Does not modify public OpenSSL APIs
* Uses isolated runtime
* Supports CPU fallback

Patch-based integration ensures maintainability across OpenSSL versions.

---

## Licensing

This repository contains only integration code.

It does NOT include:

* OpenSSL source code
* NVIDIA cuPQC SDK
* CUDA Toolkit

Users must obtain those separately under their respective licenses.

---

## Project Status

The engine is fully functional and architecturally stable. It successfully performs hardware-offloaded ML-KEM-768 key encapsulation for standard OpenSSL TLS 1.3 connections.

**Core achievements include:**
<ul>
<li>Correctness: Validated bit-exact key exchange and successful handshake completion.</li>
<li>Architecture: Strict separation of OpenSSL API and GPU runtime for full library compliance.</li>
<li>Performance: Asynchronous batching logic is implemented and operational, ready for multi-threaded deployment.</li>
</ul>
87 changes: 87 additions & 0 deletions benchmarks/benchmark_cpu.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
/* benchmark_cpu.c - Multi-Threaded CPU Benchmark */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <pthread.h>
#include <openssl/evp.h>
#include <openssl/err.h>

#define ALGO_NAME "ML-KEM-768"

// Shared specific variables
int g_iterations_per_thread = 0;
EVP_PKEY *g_pkey = NULL;

// The Worker Thread Function
void *cpu_worker(void *arg) {
EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new(g_pkey, NULL);
if (!ctx) return NULL;

// We do NOT call Async init, so this runs on CPU Software path
EVP_PKEY_encapsulate_init(ctx, NULL);

unsigned char *secret = malloc(32);
unsigned char *ciphertext = malloc(1088);
size_t secret_len = 32;
size_t ciphertext_len = 1088;

for (int i = 0; i < g_iterations_per_thread; i++) {
secret_len = 32;
ciphertext_len = 1088;
EVP_PKEY_encapsulate(ctx, ciphertext, &ciphertext_len, secret, &secret_len);
}

free(secret);
free(ciphertext);
EVP_PKEY_CTX_free(ctx);
return NULL;
}

int main(int argc, char **argv) {
int num_threads = 4; // Default to 4 cores
int total_iters = 100000;

if (argc > 1) num_threads = atoi(argv[1]);
if (argc > 2) total_iters = atoi(argv[2]);

g_iterations_per_thread = total_iters / num_threads;

printf("Benchmarking Multi-Core CPU Performance\n");
printf("Algorithm: %s\n", ALGO_NAME);
printf("Threads: %d\n", num_threads);
printf("Total Ops: %d\n", num_threads * g_iterations_per_thread);

// Generate Key (Once)
EVP_PKEY_CTX *kctx = EVP_PKEY_CTX_new_from_name(NULL, ALGO_NAME, NULL);
EVP_PKEY_keygen_init(kctx);
EVP_PKEY_keygen(kctx, &g_pkey);
EVP_PKEY_CTX_free(kctx);

// Launch Threads
pthread_t threads[num_threads];
clock_t start = clock();
struct timespec ts_start, ts_end;
clock_gettime(CLOCK_MONOTONIC, &ts_start);

for (int i = 0; i < num_threads; i++) {
pthread_create(&threads[i], NULL, cpu_worker, NULL);
}

// Wait for Threads
for (int i = 0; i < num_threads; i++) {
pthread_join(threads[i], NULL);
}

clock_gettime(CLOCK_MONOTONIC, &ts_end);

double time_spent = (ts_end.tv_sec - ts_start.tv_sec) +
(ts_end.tv_nsec - ts_start.tv_nsec) / 1e9;

printf("\n--- CPU Results ---\n");
printf("Total Time: %.2f seconds\n", time_spent);
printf("Ops/Sec: %.2f\n", (double)(num_threads * g_iterations_per_thread) / time_spent);

EVP_PKEY_free(g_pkey);
return 0;
}
Loading