cascadeflow Quick Start Guide

Get started with cascadeflow in 5 minutes. This guide walks you through the basics of intelligent model cascading.

📚 Table of Contents

What is cascadeflow?
Installation
Your First Cascade
How It Works
Understanding Costs
Configuration Options
Best Practices
Troubleshooting
Next Steps

What is cascadeflow?

cascadeflow is an intelligent model router that saves you 40-60% on AI costs by automatically using cheaper models when possible and only escalating to expensive models when needed.

The Problem

Using GPT-4o for everything is expensive:

10,000 queries/month × $0.005/query = $50/month

But using GPT-4o-mini for everything sacrifices quality.

The Solution

cascadeflow tries the cheap model first, checks quality, and only uses the expensive model if needed:

Simple query → GPT-4o-mini ✅ (draft accepted) → Cost: $0.0004
Complex query → GPT-4o-mini ❌ (draft rejected) → GPT-4o ✅ → Cost: $0.006

Result: 40-60% savings while maintaining quality!

Installation

Step 1: Install cascadeflow

pip install cascadeflow[all]

Step 2: Set Up API Key

# OpenAI
export OPENAI_API_KEY="sk-..."

# Or add to your .env file
echo "OPENAI_API_KEY=sk-..." >> .env

Step 3: Verify Installation

python -c "import cascadeflow; print(cascadeflow.__version__)"

Your First Cascade

Create a file called my_first_cascade.py:

import asyncio
from cascadeflow import CascadeAgent, ModelConfig

async def main():
    # Configure cascade with two tiers
    agent = CascadeAgent(models=[
        # Tier 1: Cheap model (tries first)
        ModelConfig(
            name="gpt-4o-mini",
            provider="openai",
            cost=0.000375,  # $0.375 per 1M tokens (blended)
        ),

        # Tier 2: Expensive model (only if needed)
        ModelConfig(
            name="gpt-4o",
            provider="openai",
            cost=0.00625,  # $6.25 per 1M tokens (blended)
        ),
    ])
    # Quality validation uses default cascade-optimized config (0.7 threshold)
    # See "Quality Configuration" section below to customize

    # Try a simple query
    result = await agent.run("What color is the sky?")

    print(f"Response: {result.content}")
    print(f"Model used: {result.model_used}")
    print(f"Cost: ${result.total_cost:.6f}")
    print(f"Draft accepted: {result.draft_accepted}")

if __name__ == "__main__":
    asyncio.run(main())

Run it:

python my_first_cascade.py

Expected output:

Response: The sky is typically blue during the day.
Model used: gpt-4o-mini
Cost: $0.000014
Draft accepted: True

What happened?

Query sent to GPT-4o-mini (cheap)
Response passed quality check
GPT-4o was NOT called (saved money!)

How It Works

The Cascade Process

┌─────────────────┐
│  Your Query     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Complexity     │ ─────► Simple/Moderate/Complex
│  Detection      │
└────────┬────────┘
         │
         ▼
   ┌─────────────┐
   │ Direct to   │ ───► Very simple → GPT-4o-mini only
   │ GPT-4o-mini?│ ───► Very complex → GPT-4o directly
   └──────┬──────┘
          │ Maybe cascade
          ▼
┌─────────────────┐
│ GPT-4o-mini     │ ────► Generate response
│ Draft           │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Quality Check   │ ────► Confidence > threshold?
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
  PASS      FAIL
    │         │
    │    ┌────────────────┐
    │    │  GPT-4o Verify │
    │    └────────┬───────┘
    │             │
    └─────────────┘
         │
         ▼
   ┌──────────────┐
   │  Final       │
   │  Response    │
   └──────────────┘

Key Concepts

1. Draft Model

Purpose: Try to answer with cheap model
Cost: Low (~$0.000375 per 1K tokens)
Speed: Fast
Quality: Good for simple queries

2. Verifier Model

Purpose: Verify draft or handle complex queries
Cost: Higher (~$0.00625 per 1K tokens)
Speed: Slower
Quality: Best quality

3. Quality Check

Checks: Confidence score, alignment, coherence
Threshold: Configurable (default: 0.7)
Result: Pass → use draft, Fail → use verifier

4. Draft Accepted vs Rejected

Draft Accepted ✅

Cheap model response is good enough
Verifier is NOT called
Cost = cheap model only
This is where you save money!

Draft Rejected ❌

Cheap model response not good enough
BOTH models are called
Cost = cheap + expensive
Quality is ensured

Understanding Costs

Token-Based Pricing

cascadeflow uses actual token-based pricing, not flat rates:

# Your query
query = "What is Python?"  # ~4 tokens

# Model's response
response = "Python is a programming language..."  # ~50 tokens

# Total tokens
total = 4 (input) + 50 (output) = 54 tokens

# Cost calculation (GPT-4o-mini example)
input_cost  = (4 / 1000) × $0.00015 = $0.0000006
output_cost = (50 / 1000) × $0.0006 = $0.000030
total_cost  = $0.0000306

Cost Breakdown by Scenario

Scenario 1: Draft Accepted (Best Case)

Query → GPT-4o-mini ✅ (accepted)

Costs:
  GPT-4o-mini: $0.000031
  GPT-4o:      $0.000000 (not called)
  ─────────────────
  Total:       $0.000031

Savings: ~95% vs GPT-4o only

Scenario 2: Draft Rejected (Worst Case)

Query → GPT-4o-mini ❌ (rejected) → GPT-4o ✅

Costs:
  GPT-4o-mini: $0.000031
  GPT-4o:      $0.000650
  ─────────────────
  Total:       $0.000681

Savings: -5% vs GPT-4o only (paid extra for GPT-4o-mini)

Scenario 3: Direct Routing

Query → GPT-4o directly (complex query)

Costs:
  GPT-4o-mini: $0.000000 (not called)
  GPT-4o:      $0.000650
  ─────────────────
  Total:       $0.000650

Savings: 0% (same as GPT-4o only)

Expected Savings

Your savings depend on your query mix:

Query Mix	Draft Acceptance Rate	Expected Savings
80% simple, 20% complex	80%	60-70%
50% simple, 50% complex	50%	40-50%
20% simple, 80% complex	20%	10-20%

Rule of thumb: The more simple queries, the more you save!

Configuration Options

Model Configuration

ModelConfig(
    name="gpt-4o-mini",             # Model name
    provider="openai",              # Provider (openai, anthropic, groq, ollama)
    cost=0.000375,                  # Cost per 1K tokens (blended estimate)
    speed_ms=500,                   # Expected latency (optional)
    supports_tools=True,            # Whether model supports tool calling (optional)
)

Agent Configuration

agent = CascadeAgent(
    models=[tier1, tier2],          # List of models (ordered by cost)
    verbose=True,                   # Enable logging
    enable_cascade=True,            # Enable cascade system
)

Quality Configuration

Quality validation is controlled via QualityConfig, not individual models:

from cascadeflow import CascadeAgent, ModelConfig, QualityConfig

# Option 1: Use preset configurations
agent = CascadeAgent(
    models=[...],
    quality_config=QualityConfig.for_cascade()     # Optimized for cascading (default)
)

# Option 2: Use other presets
quality_config = QualityConfig.for_production()    # Balanced (0.80 threshold)
quality_config = QualityConfig.for_development()   # Lenient (0.65 threshold)
quality_config = QualityConfig.strict()            # Rigorous (0.95 threshold)

# Option 3: Customize thresholds by complexity
quality_config = QualityConfig(
    confidence_thresholds={
        'trivial': 0.6,    # Very simple queries
        'simple': 0.7,     # Simple queries
        'moderate': 0.75,  # Moderate complexity
        'hard': 0.8,       # Hard queries
        'expert': 0.85     # Expert-level queries
    }
)

agent = CascadeAgent(models=[...], quality_config=quality_config)

Quality Threshold Trade-offs:

Higher threshold (0.8+) → Better quality, fewer drafts accepted, lower savings
Medium threshold (0.7) → Balanced quality and savings (recommended)
Lower threshold (0.6-) → More drafts accepted, higher savings, occasional quality issues

Best Practices

1. Choose the Right Models

Good Combinations:

GPT-4o-mini → GPT-4o (balanced, recommended)
GPT-4o-mini → GPT-4 Turbo (quality-focused)
Llama 3.1 8B → GPT-4o (maximum savings)

Avoid:

Similar-tier models (GPT-4o-mini → GPT-3.5 Turbo)
Reverse ordering (GPT-4o → GPT-4o-mini)

2. Tune Quality Thresholds

Start with default (0.7) and adjust based on your needs:

# Track acceptance rates
results = []
for query in your_queries:
    result = await agent.run(query)
    results.append(result.draft_accepted)

acceptance_rate = sum(results) / len(results)
print(f"Draft acceptance rate: {acceptance_rate:.1%}")

If acceptance rate is:

< 30% → Lower threshold (0.6) or use better draft model
30-70% → Perfect! (balanced)
70% → Can raise threshold (0.75) for better quality

3. Monitor Costs

# Track costs over time
total_cost = 0
for query in your_queries:
    result = await agent.run(query)
    total_cost += result.total_cost

print(f"Total cost: ${total_cost:.6f}")
print(f"Average per query: ${total_cost/len(your_queries):.6f}")

4. Handle Failures Gracefully

try:
    result = await agent.run(query)
except Exception as e:
    print(f"Error: {e}")
    # Fallback logic here

5. Use Appropriate Max Tokens

# Short responses (save cost)
result = await agent.run(query, max_tokens=50)

# Medium responses (balanced)
result = await agent.run(query, max_tokens=150)

# Long responses (quality)
result = await agent.run(query, max_tokens=500)

Troubleshooting

Issue: All Queries Go to Expensive Model

Symptoms:

Draft acceptance rate < 10%
Costs almost same as GPT-4 only

Solutions:

Lower quality threshold via QualityConfig:

quality_config = QualityConfig(confidence_thresholds={'moderate': 0.6})
agent = CascadeAgent(models=[...], quality_config=quality_config)

Use better draft model: Try GPT-4o-mini (already recommended)
Check query complexity: Ensure you have simple queries in your mix

Issue: Poor Quality Responses

Symptoms:

Draft acceptance rate > 80%
Responses are incorrect or low quality

Solutions:

Raise quality threshold via QualityConfig:

quality_config = QualityConfig(confidence_thresholds={'moderate': 0.75})
agent = CascadeAgent(models=[...], quality_config=quality_config)

Use better verifier model: Try GPT-4o instead of GPT-4
Enable verbose mode to see quality scores: verbose=True

Issue: High Latency

Symptoms:

Responses take too long
Users complaining about wait times

Solutions:

Use faster models: Groq Llama for draft, GPT-4o-mini for verifier
Enable streaming: enable_streaming=True
Reduce max_tokens: max_tokens=100
Skip cascade for time-critical queries

Issue: Costs Higher Than Expected

Symptoms:

Savings < 30%
Many drafts rejected

Possible Causes:

Query mix too complex (mostly hard queries)
Quality threshold too high (rejecting good drafts)
Token estimates inaccurate

Solutions:

Analyze your query complexity distribution
Lower quality threshold slightly
Use cheaper draft model (Groq Llama, Ollama)

Next Steps

1. Run the Basic Example

python examples/basic_usage.py

2. Customize for Your Use Case

Modify models
Adjust thresholds
Add your queries

3. Read Advanced Guides

4. Deploy to Production

Set up monitoring
Configure logging
Implement fallbacks
Track costs

5. Join the Community

⭐ Star the GitHub repo
💬 Join Discussions
🐛 Report issues
🤝 Contribute examples

Quick Reference

Common Commands

# Install
pip install cascadeflow[all]

# Run example
python examples/basic_usage.py

# Check version
python -c "import cascadeflow; print(cascadeflow.__version__)"

# Run with verbose logging
python examples/basic_usage.py --verbose

Code Snippets

Basic Usage:

from cascadeflow import CascadeAgent, ModelConfig

agent = CascadeAgent(models=[
    ModelConfig("gpt-4o-mini", "openai", cost=0.000375),
    ModelConfig("gpt-4o", "openai", cost=0.00625),
])

result = await agent.run("Your query here")

Check Result:

print(f"Response: {result.content}")
print(f"Model: {result.model_used}")
print(f"Cost: ${result.total_cost:.6f}")
print(f"Draft accepted: {result.draft_accepted}")

Track Costs:

total = sum(r.total_cost for r in results)
print(f"Total: ${total:.6f}")

Support

Need help?

💬 Ask in Discussions
🐛 Report a bug

Happy Cascading! 🌊

FilesExpand file tree

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

cascadeflow Quick Start Guide

📚 Table of Contents

What is cascadeflow?

The Problem

The Solution

Installation

Step 1: Install cascadeflow

Step 2: Set Up API Key

Step 3: Verify Installation

Your First Cascade

How It Works

The Cascade Process

Key Concepts

1. Draft Model

2. Verifier Model

3. Quality Check

4. Draft Accepted vs Rejected

Understanding Costs

Token-Based Pricing

Cost Breakdown by Scenario

Scenario 1: Draft Accepted (Best Case)

Scenario 2: Draft Rejected (Worst Case)

Scenario 3: Direct Routing

Expected Savings

Configuration Options

Model Configuration

Agent Configuration

Quality Configuration

Best Practices

1. Choose the Right Models

2. Tune Quality Thresholds

3. Monitor Costs

4. Handle Failures Gracefully

5. Use Appropriate Max Tokens

Troubleshooting

Issue: All Queries Go to Expensive Model

Issue: Poor Quality Responses

Issue: High Latency

Issue: Costs Higher Than Expected

Next Steps

1. Run the Basic Example

2. Customize for Your Use Case

3. Read Advanced Guides

4. Deploy to Production

5. Join the Community

Quick Reference

Common Commands

Code Snippets

Support