A transparent, high-performance C/C++ layer designed to radically accelerate CPU-based Large Language Model inference (Ollama, llama.cpp, vLLM) without any code changes or API modifications.
By functioning as a runtime interceptor natively across Linux, macOS, and Windows, it seamlessly optimizes critical bottlenecks under the hood.
- Zero-Config Integration: Works as a drop-in wrapper. No codebase changes required.
- Smart Memory Pooling: Intercepts tensors dynamically, aligning them to 64B cache lines (
posix_memalign/VirtualAlloc) to eliminate CPU false sharing. - Core Pinning & NUMA Awareness: Hijacks
pthread_createto prevent kernel thread migration, keeping L1/L2 caches hot. - KV-Cache Optimization: Groundwork for detecting and mapping KV cache buffers cleanly.
- Cross-Platform: Fully native hooking on Linux (
LD_PRELOAD), macOS (DYLD_INTERPOSE), and Windows (Custom IAT Injector).
- CMake (v3.10+)
- C Compiler (GCC, Clang, or MSVC)
1. Clone the repository:
git clone https://github.com/overseek944/CoreFlux.git
cd CoreFlux2. Compile automatically:
On Windows:
Simply double-click the build.bat file in your folder, or run it from the terminal:
build.batOn Linux / macOS: Run the provided shell script:
chmod +x build.sh
./build.shYou can use CoreFlux transparently with any existing LLM application (e.g., llama.cpp, ollama, vllm). No configurations or flags are needed.
Pre-load the shared library before launching your desired executable.
LD_PRELOAD=/path/to/CoreFlux/build/libcpu_llm_accel.so ./llama.cpp -m model.ggufInject the dynamic library using Apple's preloading mechanism.
DYLD_INSERT_LIBRARIES=/path/to/CoreFlux/build/libcpu_llm_accel.dylib ./llama.cpp -m model.ggufUse the built-in injector wrapper accel_run.exe to launch your target application. Ensure cpu_llm_accel.dll is in the same directory as the injector.
accel_run.exe llama.cpp -m model.ggufThis library operates similarly to jemalloc or OpenBLAS. It wraps standard OS memory and threading APIs. When tools like llama.cpp request heap space or background threads, our layer intercepts these calls, maps the CPU topology, and enforces stringent physical core affinity and memory boundaries tailored explicitly for LLM tensor workloads.
Contributions are heavily encouraged! Please read the CONTRIBUTING.md before submitting pull requests.
Distributed under the MIT License. See LICENSE for more information.