Cascade Engine

This repository provides the code for our framework Cascade Engine: A Multi-Tier, Intelligent Routing Framework for Cost-Effective LLM Inference. In this README, we guide you through installing the library, using the intelligent cascade routing effectively, and reproducing our results.

Installation

Follow these steps to set up the environment and install the necessary dependencies.

Clone the repository:

git clone https://github.com/your-username/cascade-engine.git
cd cascade-engine

Create and activate the environment:

python -m venv venv
source venv/bin/activate

Install the package and dependencies:

pip install -r python_core/requirements.txt

Download required models (for intelligent layers):

python -m spacy download en_core_web_lg

Getting Started

This library allows you to run multi-tier cascading strategies with intelligent preprocessing layers for any downstream LLM application.

Step 1. Initialize the Intelligent Layers

Cascade Engine relies on preprocessing layers to aggressively filter and cache queries before they reach expensive models.

from python_core.router.intelligent_layers import SemanticCache, PrivacyFilter, Gatekeeper

# 1. Initialize Privacy Filter (Presidio) to redact PII
privacy_filter = PrivacyFilter()

# 2. Initialize Semantic Cache (FAISS) for exact/close matches
cache = SemanticCache(threshold=0.85)

# 3. Initialize Gatekeeper (DistilBERT + VADER) for complexity routing
gatekeeper = Gatekeeper()

Step 2. Initialize the Engines

Define the tiers of models you want to use. We typically use a 3-tier system: Local, Mid-Cloud, and Premium-Cloud.

from python_core.engines.local_engine import OllamaEngine
from python_core.engines.cloud_engine import OpenAIEngine

tier1 = OllamaEngine(model_name="llama3.2:3b")
tier2 = OpenAIEngine(model_name="gpt-4o-mini", api_key="YOUR_KEY")
tier3 = OpenAIEngine(model_name="gpt-4o", api_key="YOUR_KEY")

engines = [tier1, tier2, tier3]

Step 3. Run Queries through the Router

You can now use the CascadeRouter or FrugalRouter to process queries dynamically based on complexity and confidence.

from python_core.router.cascade_router import FrugalRouter

router = FrugalRouter(
    engines=engines,
    privacy_filter=privacy_filter,
    cache=cache,
    gatekeeper=gatekeeper,
    confidence_threshold=0.9
)

query = "Write a python script to reverse a linked list."

# The router will automatically triage through cache, privacy, and the required model tier
response, metadata = router.predict(query)

print(response)
print(f"Routed to: {metadata['engine_used']}")
print(f"Cost: ${metadata['cost']}")

Architecture & Request Flow

When a user submits a query, it undergoes a sequential triage process designed to minimize costs while maximizing safety and speed. This ensures that expensive premium models are only called when absolutely necessary.

flowchart TD
    %% Define styles
    classDef request fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef filter fill:#e1f5fe,stroke:#0288d1,stroke-width:2px;
    classDef router fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    classDef tier1 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px;
    classDef tier2 fill:#fff8e1,stroke:#fbc02d,stroke-width:2px;
    classDef tier3 fill:#ffebee,stroke:#d32f2f,stroke-width:2px;
    
    A(["User Request"]):::request --> B["PrivacyFilter <br/>(Presidio)"]:::filter
    B --> C{"SemanticCache <br/>(FAISS)"}:::filter
    
    C -- "Cache Hit" --> D(["Return Cached Response"]):::request
    
    C -- "Cache Miss" --> E["Gatekeeper & Intent <br/>(DistilBERT + VADER)"]:::filter
    E --> F{"Routing Decision"}:::router
    
    F -- "Simple / Factual" --> G["Tier 1: Local Model <br/>llama3.2:3b"]:::tier1
    F -- "Moderate" --> H["Tier 2: Mid-Cloud <br/>gpt-4o-mini"]:::tier2
    F -- "Complex Reasoning" --> I["Tier 3: Premium <br/>gpt-4o"]:::tier3
    
    G --> J[("Update Cache")]
    H --> J
    I --> J
    J --> K(["Final Response"]):::request
    D --> K

Reproducing Results

To reproduce the results presented in the paper, including the Pareto frontier evaluations on Alpaca-Eval and the ablation studies:

1. Run the Test Suite

Ensure all components are functioning correctly:

# Fast tests (skips loading heavy models)
pytest -m "not heavy"

# Full test suite
pytest

2. Run Benchmarks

Run the experiment script to execute the inference pipeline against the baseline models (e.g., RouteLLM):

python python_core/scripts/run_experiment.py

This will create a timestamped folder inside the results/ directory containing pareto.csv and manifest.json.

3. Generate Paper Figures

Once your experiments have finished, you can generate the exact PDF plots used in the manuscript:

python -m python_core.scripts.make_figures results/<YOUR_TIMESTAMP_DIR>

Code Structure

Below is a high-level overview of the code in this repository:

python_core/engines/: Connectors to the underlying LLMs. local_engine.py handles local open-source models (via Ollama), while cloud_engine.py handles standard APIs (OpenAI).
python_core/router/: The core logic of the framework.
- cascade_router.py: The Frugal and base routing logic.
- learned_router.py: Implementation of Contextual Discounted Thompson Sampling (CD-TS).
- intelligent_layers.py: The preprocessing modules (SemanticCache, PrivacyFilter, Gatekeeper).
- benchmark.py: Evaluates the routers against Alpaca-Eval.
python_core/scripts/: Experiment execution (run_experiment.py) and visualization generation (make_figures.py).
paper/: The LaTeX source files and a compiled Markdown version of our academic manuscript.

Citation

If you use this codebase or find our framework useful, please cite our paper:

@article{cascadeengine2026,
  title={Cascade Engine: A Multi-Tier, Intelligent Routing Framework for Cost-Effective LLM Inference},
  author={Nabin Prasad Dev},
  year={2026},
  journal={arXiv preprint}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
notebooks		notebooks
paper		paper
python_core		python_core
typescript_api		typescript_api
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STATUS.md		STATUS.md
docker-compose.yaml		docker-compose.yaml
pytest.ini		pytest.ini
run_dev.sh		run_dev.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cascade Engine

Installation

Getting Started

Step 1. Initialize the Intelligent Layers

Step 2. Initialize the Engines

Step 3. Run Queries through the Router

Architecture & Request Flow

Reproducing Results

1. Run the Test Suite

2. Run Benchmarks

3. Generate Paper Figures

Code Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cascade Engine

Installation

Getting Started

Step 1. Initialize the Intelligent Layers

Step 2. Initialize the Engines

Step 3. Run Queries through the Router

Architecture & Request Flow

Reproducing Results

1. Run the Test Suite

2. Run Benchmarks

3. Generate Paper Figures

Code Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages