Provider Comparison Guide

Complete guide to AI providers supported by cascadeflow and how to mix them effectively.

📋 Table of Contents

Overview
Supported Providers
Provider Comparison
Mixing Providers
Cost Analysis
Setup Guide
Best Practices
Troubleshooting

Overview

cascadeflow supports 7 AI providers, each with unique strengths. You can mix any combination of providers in a single cascade for optimal cost, speed, and quality.

Why Mix Providers?

Cost Optimization - Start with free/cheap providers
Quality Specialization - Use best provider for each task
High Availability - Fallback if provider is down
Speed Optimization - Fast drafts, accurate verification
Compliance - Some providers better for regulated industries

Supported Providers

1. OpenAI

Models: GPT-4o, GPT-4o-mini, GPT-4 Turbo

Strengths:

✅ Best overall quality
✅ Excellent tool/function calling
✅ Wide model selection
✅ Best for technical tasks
✅ 128K token context (GPT-4o)

Weaknesses:

❌ Most expensive ($0.00625/request for GPT-4o)
❌ Rate limits can be strict
❌ Slower than Groq

Best For: Code generation, technical Q&A, tool calling, general intelligence

Setup:

export OPENAI_API_KEY="sk-..."

from cascadeflow import ModelConfig

model = ModelConfig(
    name="gpt-4o",
    provider="openai",
    cost=0.00625,  # Actual cost per 1M tokens
)

2. Anthropic (Claude)

Models: Claude Sonnet 4.5, Claude Haiku 4.5, Claude Opus 4.1

Strengths:

✅ Excellent reasoning ability
✅ Best for long context (200K tokens)
✅ Strong at analysis and writing
✅ Good for complex workflows
✅ More affordable than GPT-4o

Weaknesses:

❌ Mid-high cost ($0.003/request)
❌ Fewer model options
❌ Slower than Groq

Best For: Long document analysis, complex reasoning, writing tasks, research

Setup:

export ANTHROPIC_API_KEY="sk-ant-..."

model = ModelConfig(
    name="claude-sonnet-4-5-20250929",
    provider="anthropic",
    cost=0.003,
)

3. Groq

Models: Llama 3.1 (8B, 70B), Llama 3.3 70B, Mixtral, DeepSeek, Qwen

Strengths:

✅ Extremely fast (8x faster than others)
✅ FREE tier available
✅ Good for simple queries
✅ Low latency (200-300ms)
✅ Multiple model options

Weaknesses:

❌ Limited context (8K tokens)
❌ Lower quality on complex tasks
❌ No logprobs support

Best For: Simple queries, high-volume applications, fast responses, cost-sensitive workloads

Setup:

export GROQ_API_KEY="gsk_..."

model = ModelConfig(
    name="llama-3.1-8b-instant",
    provider="groq",
    cost=0.0,  # FREE!
)

4. Ollama (Local)

Models: Llama 3.2, Llama 3.1, Mistral, Phi, Qwen, etc.

Strengths:

✅ FREE (self-hosted)
✅ Privacy (data never leaves your machine)
✅ Works offline
✅ No rate limits
✅ Runs on consumer hardware

Weaknesses:

❌ Requires local setup
❌ Lower quality than cloud models
❌ Slower on CPU
❌ Limited context

Best For: Privacy-sensitive data, offline applications, development/testing, edge devices

Setup:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2

# Start server
ollama serve

model = ModelConfig(
    name="llama3.2:1b",
    provider="ollama",
    cost=0.0,  # FREE (self-hosted)
)

Auto-Discovery:

# List all installed models dynamically
from cascadeflow.providers.ollama import OllamaProvider

provider = OllamaProvider()
models = await provider.list_models()
# Returns: ['llama3.2:1b', 'mistral:7b', 'codellama:latest', ...]

# Use discovered models in cascade
for model_name in models:
    agent.models.append(ModelConfig(
        name=model_name,
        provider="ollama",
        cost=0.0
    ))

5. vLLM (Self-Hosted)

Models: Any HuggingFace model (Llama, Mistral, Qwen, etc.)

Strengths:

✅ Self-hosted (full control)
✅ Very cost-effective at scale
✅ High throughput
✅ Flexible model selection

Weaknesses:

❌ Requires infrastructure setup
❌ Needs GPU for good performance
❌ Maintenance overhead

Best For: High-volume production, cost optimization at scale, custom models

Setup:

# Start vLLM server
vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8000

model = ModelConfig(
    name="meta-llama/Llama-3.2-3B-Instruct",
    provider="vllm",
    cost=0.0001,  # Infrastructure costs
)

Auto-Discovery:

# List all models served by vLLM
from cascadeflow.providers.vllm import VLLMProvider

provider = VLLMProvider(base_url="http://localhost:8000/v1")
models = await provider.list_models()
# Returns: ['meta-llama/Llama-3.2-3B-Instruct', ...]

# Dynamically configure cascade from available models
for model_name in models:
    agent.models.append(ModelConfig(
        name=model_name,
        provider="vllm",
        cost=0.0001
    ))

6. HuggingFace

Models: 100,000+ open-source models

Strengths:

✅ Massive model selection
✅ Free tier available
✅ Easy to try new models
✅ Community support

Weaknesses:

❌ Variable quality
❌ Slower inference
❌ Rate limits on free tier

Best For: Experimentation, specialized models, research

7. Together AI

Models: Llama 3.1, Mixtral, DeepSeek, Qwen, etc.

Strengths:

✅ Good performance
✅ Competitive pricing
✅ Multiple model options
✅ Fast inference

Weaknesses:

❌ Less known than others
❌ Smaller ecosystem

Best For: Cost-effective cloud inference, alternative to Groq

Provider Comparison

By Speed (Latency)

Rank	Provider	Typical Latency	Notes
1	Groq	200-300ms	8x faster than others
2	Together	400-600ms	Good speed
3	OpenAI	600-1500ms	Varies by model
4	Anthropic	800-1200ms	Consistent
5	HuggingFace	1000-3000ms	Variable
6	Ollama	500-5000ms	Depends on hardware
7	vLLM	300-2000ms	Depends on setup

By Cost (Per Request)

Rank	Provider	Typical Cost	Notes
1	Groq	$0.00	FREE tier
2	Ollama	$0.00	Self-hosted
3	vLLM	$0.0001	Infrastructure costs
4	Together	$0.0002-0.001	Competitive
5	OpenAI Mini	$0.00015	GPT-4o-mini
6	HuggingFace	$0.0005	Free tier available
7	Anthropic	$0.001-0.003	Claude models
8	OpenAI	$0.00625	GPT-4o premium

By Quality

Rank	Provider	Quality Score	Best For
1	OpenAI GPT-4o	0.95	Complex tasks
2	Anthropic Claude Sonnet 4.5	0.92	Reasoning
3	OpenAI GPT-4o-mini	0.88	General tasks
4	Groq Llama 3.3 70B	0.85	Simple tasks
5	Together	0.82	Basic queries
6	Groq Llama 3.1 8B	0.78	Very simple
7	Ollama	0.70-0.80	Development

By Context Length

Provider	Model	Max Context	Best For
Anthropic	Claude 3.5	200K tokens	Long documents
OpenAI	GPT-4o	128K tokens	Large context
Groq	Llama 3.3	32K tokens	Medium context
Groq	Llama 3.1	8K tokens	Short context
Ollama	Varies	2K-32K tokens	Local processing

Mixing Providers

Pattern 1: Free-First Cascade

Goal: Maximum cost savings

agent = CascadeAgent(models=[
    ModelConfig("llama-3.1-8b-instant", provider="groq", cost=0),
    ModelConfig("gpt-4o-mini", provider="openai", cost=0.00015),
    ModelConfig("gpt-4o", provider="openai", cost=0.00625),
])

When: High-volume applications (50K+ requests/month)
Savings: 70-98% vs all-premium

Pattern 2: Cross-Provider Drafter/Verifier

Goal: Quality assurance with cost savings

agent = CascadeAgent(models=[
    # Fast drafter (Groq)
    ModelConfig("llama-3.1-70b-versatile", provider="groq", cost=0),
    
    # Premium verifier (Claude or GPT)
    ModelConfig("claude-sonnet-4-5-20250929", provider="anthropic", cost=0.003),
])

When: Quality-critical applications
Savings: 60-80% vs all-premium

Pattern 3: Specialization

Goal: Optimize quality for each task type

# Technical tasks → OpenAI
# Writing tasks → Anthropic
# Simple tasks → Groq

agent = CascadeAgent(models=[
    ModelConfig("llama-3.1-8b", provider="groq", cost=0),
    ModelConfig("gpt-4o", provider="openai", cost=0.00625),
    ModelConfig("claude-sonnet-4-5-20250929", provider="anthropic", cost=0.003),
])

When: Diverse workload mix
Savings: 40-70% vs all-premium

Pattern 4: Reliability

Goal: High availability with redundancy

# Multiple providers at same tier
agent = CascadeAgent(models=[
    ModelConfig("gpt-4o-mini", provider="openai", cost=0.00015),
    ModelConfig("claude-3-5-haiku", provider="anthropic", cost=0.001),
    ModelConfig("llama-3.1-70b", provider="groq", cost=0),
])

When: Production systems with SLA requirements
Benefit: Automatic fallback if provider down

Cost Analysis

Monthly Cost Comparison (100K requests)

Strategy	Configuration	Monthly Cost	Savings
All Premium	GPT-4o only	$625	0% (baseline)
Single Provider	GPT-4o-mini only	$15	98%
Free-First	Groq→Mini→GPT-4o	$32	95%
Cross-Provider	Groq→Claude	$45	93%
Specialization	Mixed routing	$85	86%
All Free	Groq→Ollama	$0	100%

Real-World Distribution (Free-First Pattern)

Assuming: 100K total requests

Simple queries (50%):  50K × $0 (Groq) = $0
Moderate (30%):        30K × $0.00015 (GPT-4o-mini) = $4.50
Complex (15%):         15K × $0.00625 (GPT-4o) = $93.75
Very complex (5%):     5K × $0.00625 (GPT-4o) = $31.25

Total: $129.50/month (79% savings vs all-GPT-4o)

Setup Guide

Quick Setup

# 1. Install cascadeflow
pip install cascadeflow[all]

# 2. Set API keys (all optional)
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."

# 3. Create agent
python examples/multi_provider.py

Minimal Setup (1 Provider)

# Works with just OpenAI
agent = CascadeAgent(models=[
    ModelConfig("gpt-4o-mini", provider="openai", cost=0.00015),
    ModelConfig("gpt-4o", provider="openai", cost=0.00625),
])

Recommended Setup (2 Providers)

# Groq + OpenAI (best value)
agent = CascadeAgent(models=[
    ModelConfig("llama-3.1-8b-instant", provider="groq", cost=0),
    ModelConfig("gpt-4o", provider="openai", cost=0.00625),
])

Advanced Setup (3+ Providers)

# Maximum flexibility
agent = CascadeAgent(models=[
    ModelConfig("llama-3.1-8b", provider="groq", cost=0),
    ModelConfig("gpt-4o-mini", provider="openai", cost=0.00015),
    ModelConfig("claude-3-5-sonnet", provider="anthropic", cost=0.003),
    ModelConfig("gpt-4o", provider="openai", cost=0.00625),
])

Best Practices

1. Start with Free Providers

# Always try free first
models = []
if os.getenv("GROQ_API_KEY"):
    models.append(ModelConfig("llama-3.1-8b", provider="groq", cost=0))

# Then add paid
if os.getenv("OPENAI_API_KEY"):
    models.append(ModelConfig("gpt-4o", provider="openai", cost=0.00625))

2. Configure Fallbacks

# Multiple providers for redundancy
models = [
    ModelConfig("gpt-4o-mini", provider="openai", cost=0.00015),
    ModelConfig("claude-3-5-haiku", provider="anthropic", cost=0.001),  # Fallback
]

3. Monitor Costs

result = await agent.run(query)
print(f"Cost: ${result.total_cost:.6f}")
print(f"Provider: {result.model_used}")

4. Use Provider Strengths

# Code → OpenAI
# Writing → Anthropic
# Simple → Groq

if "code" in query.lower():
    force_provider = "openai"
elif "write" in query.lower():
    force_provider = "anthropic"

5. Set Appropriate Thresholds

from cascadeflow import QualityConfig

# Configure quality validation thresholds
quality_config = QualityConfig(confidence_thresholds={'moderate': 0.85})
agent = CascadeAgent(
    models=models,
    quality_config=quality_config  # Higher thresholds = more cascades
)

Troubleshooting

Provider Not Available

Issue: Provider not detected

Solution:

# Check API key is set
echo $OPENAI_API_KEY

# Test provider connection
python -c "from cascadeflow.providers import OpenAIProvider; OpenAIProvider()"

Cost Higher Than Expected

Issue: Cascading too much

Solution:

# Lower quality threshold via QualityConfig
from cascadeflow import QualityConfig

quality_config = QualityConfig(confidence_thresholds={'moderate': 0.75})
agent = CascadeAgent(models=models, quality_config=quality_config)

# Or use force_direct for expensive queries
result = await agent.run(query, force_direct=True)

Rate Limits Hit

Issue: Too many requests to one provider

Solution:

# Add more providers for load distribution
models.append(ModelConfig("claude-3-5-haiku", provider="anthropic", cost=0.001))

One Provider Failing

Issue: Provider API down or rate limited

Solution:

Cascade automatically falls back to next provider
Check logs to see which provider was used
Configure more fallback options

Using Additional Providers via LiteLLM

cascadeflow integrates with LiteLLM for:

Accurate cost tracking across 100+ models
Access to additional providers (DeepSeek, Google, and more)
Automatic pricing updates (no manual maintenance)
Budget management per user

Supported Additional Providers

Through LiteLLM integration, you can access:

Provider	Value Proposition	Example Models	API Key
DeepSeek	5-10x cheaper for code tasks	`deepseek-coder`, `deepseek-chat`	`DEEPSEEK_API_KEY`
Google (Vertex AI)	Enterprise GCP integration	`gemini-pro`, `gemini-1.5-flash`	`GOOGLE_API_KEY`
Azure OpenAI	Enterprise compliance (HIPAA/SOC2)	`azure/gpt-4`, `azure/gpt-4-turbo`	`AZURE_API_KEY`
Fireworks AI	Fast open model inference	`accounts/fireworks/models/llama-v3-70b`	`FIREWORKS_API_KEY`
Cohere	Specialized for search/RAG	`command`, `command-light`	`COHERE_API_KEY`

Quick Start with LiteLLM

from cascadeflow.integrations.litellm import (
    LiteLLMCostProvider,
    calculate_cost,
    get_provider_info,
    SUPPORTED_PROVIDERS
)

# 1. Check if a provider is supported
info = get_provider_info("deepseek")
print(info.value_prop)
# Output: "Specialized code models, very cost-effective for coding tasks"

# 2. Calculate costs (use provider prefix for accurate pricing)
cost = calculate_cost(
    model="deepseek/deepseek-coder",
    input_tokens=1000,
    output_tokens=500
)
print(f"Cost: ${cost:.6f}")

# 3. List all supported providers
for provider_name, info in SUPPORTED_PROVIDERS.items():
    print(f"{info.display_name}: {info.value_prop}")

Using DeepSeek with cascadeflow

DeepSeek offers extremely cost-effective models specialized for coding tasks:

# Set up API key
export DEEPSEEK_API_KEY="sk-..."

from cascadeflow import CascadeAgent, ModelConfig
from cascadeflow.integrations.litellm import calculate_cost

# Calculate cost for DeepSeek (use provider prefix)
deepseek_cost = calculate_cost(
    model="deepseek/deepseek-coder",
    input_tokens=1000,
    output_tokens=1000
)

# Use in cascade (DeepSeek uses OpenAI-compatible API)
agent = CascadeAgent(models=[
    ModelConfig(
        name="deepseek-coder",
        provider="openai",  # Uses OpenAI-compatible API
        cost=deepseek_cost * 1000,  # Convert to per-1K token cost
        base_url="https://api.deepseek.com/v1"  # ✅ base_url IS supported
    ),
    ModelConfig(
        name="gpt-4o",
        provider="openai",
        cost=0.00625
    )
])

result = await agent.run("Write a Python function to merge two sorted lists")
print(f"Cost: ${result.total_cost:.6f}")
print(f"Model: {result.model_used}")

Cost Savings:

DeepSeek-Coder: ~$0.00028/1K tokens
GPT-4: ~$0.03/1K tokens
Savings: ~99% cheaper for code tasks!

Using Google Gemini with cascadeflow

Google's Gemini models offer excellent value, especially Gemini Flash:

# Set up API key
export GOOGLE_API_KEY="..."

from cascadeflow import CascadeAgent, ModelConfig
from cascadeflow.integrations.litellm import calculate_cost

# Calculate cost for Gemini (use provider prefix)
gemini_cost = calculate_cost(
    model="gemini/gemini-1.5-flash",
    input_tokens=1000,
    output_tokens=1000
)

# Use in cascade
agent = CascadeAgent(models=[
    ModelConfig(
        name="gemini-1.5-flash",
        provider="openai",  # Use generic provider for now
        cost=gemini_cost * 1000,
        base_url="https://generativelanguage.googleapis.com/v1beta"
    ),
    ModelConfig(
        name="gpt-4o",
        provider="openai",
        cost=0.00625
    )
])

result = await agent.run("Summarize this article: ...")

Cost Savings:

Gemini 1.5 Flash: ~$0.000225/1K tokens
GPT-4o: ~$0.0075/1K tokens
Savings: ~97% cheaper for simple tasks!

Cost Comparison

Here's how different providers compare for a typical task (1K input + 500 output tokens):

from cascadeflow.integrations.litellm import LiteLLMCostProvider

cost_provider = LiteLLMCostProvider()

models = [
    ("gpt-4o", "OpenAI Premium"),
    ("gpt-4o-mini", "OpenAI Budget"),
    ("deepseek/deepseek-coder", "DeepSeek Code"),
    ("gemini/gemini-1.5-flash", "Google Budget"),
    ("anthropic/claude-3-5-sonnet-20241022", "Anthropic Premium"),
]

for model, label in models:
    cost = cost_provider.calculate_cost(
        model=model,
        input_tokens=1000,
        output_tokens=500
    )
    print(f"{label:20} ${cost:.6f}")

Output:

OpenAI Premium       $0.007500
OpenAI Budget        $0.000225
DeepSeek Code        $0.000280
Google Budget        $0.000225
Anthropic Premium    $0.010500

💡 TIP: Always use provider prefixes (e.g., deepseek/deepseek-coder, anthropic/claude-3-5-sonnet-20241022, gemini/gemini-1.5-flash) for accurate pricing from LiteLLM.

Complete Example

See examples/integrations/litellm_providers.py for a comprehensive example that shows:

Supported providers - List all LiteLLM-supported providers
Cost calculation - Compare costs across providers
Model pricing - Get detailed pricing information
Cost comparison - Compare across different use cases
Provider info - Get provider capabilities dynamically
Convenience functions - Quick cost calculations
API key status - Check which keys are configured
Real-world usage - Integrate with cascadeflow agents

Benefits of LiteLLM Integration

✅ Accurate Cost Tracking

LiteLLM maintains up-to-date pricing for 100+ models
No manual pricing updates needed
Includes input/output token pricing
Handles special pricing (batch, cached tokens)

✅ Access More Providers

DeepSeek (code specialization, 5-10x cheaper)
Google/Vertex AI (enterprise, 50-100x cheaper for simple tasks)
Azure OpenAI (compliance, HIPAA/SOC2)
Fireworks, Cohere, and more

✅ Budget Management

Track spending per user
Set budget limits
Get alerts at thresholds
Enforce budgets automatically

✅ Zero Maintenance

Pricing automatically updated
New models supported quickly
Community-driven updates

When to Use LiteLLM vs Native Providers

Use Native Providers (Recommended):

OpenAI, Anthropic, Groq, Together, Ollama, vLLM, HuggingFace
Best performance and feature support
Direct integration, no extra layer
Full streaming and tool calling support

Use LiteLLM Integration:

DeepSeek (code tasks, extreme cost savings)
Google/Gemini (simple tasks, ultra-cheap)
Azure OpenAI (enterprise compliance)
Other providers not yet in native list
Need accurate cost tracking across providers

Installation

# LiteLLM is included with cascadeflow[all]
pip install cascadeflow[all]

# Or install separately
pip install litellm

Resources

Example: examples/integrations/litellm_providers.py
Integration Code: cascadeflow/integrations/litellm.py
LiteLLM Docs: https://docs.litellm.ai/docs/providers
Cost Tracking Guide: cost_tracking.md

Next Steps

Examples: See examples/multi_provider.py
LiteLLM Example: See examples/integrations/litellm_providers.py
Tools: Read Tool Guide for tool calling with providers
Cost Tracking: Read Cost Tracking Guide
API Reference: Check provider-specific documentation

Questions? Open an issue on GitHub.

FilesExpand file tree

providers.md

Latest commit

History

providers.md

File metadata and controls

Provider Comparison Guide

📋 Table of Contents

Overview

Why Mix Providers?

Supported Providers

1. OpenAI

2. Anthropic (Claude)

3. Groq

4. Ollama (Local)

5. vLLM (Self-Hosted)

6. HuggingFace

7. Together AI

Provider Comparison

By Speed (Latency)

By Cost (Per Request)

By Quality

By Context Length

Mixing Providers

Pattern 1: Free-First Cascade

Pattern 2: Cross-Provider Drafter/Verifier

Pattern 3: Specialization

Pattern 4: Reliability

Cost Analysis

Monthly Cost Comparison (100K requests)

Real-World Distribution (Free-First Pattern)

Setup Guide

Quick Setup

Minimal Setup (1 Provider)

Recommended Setup (2 Providers)

Advanced Setup (3+ Providers)

Best Practices

1. Start with Free Providers

2. Configure Fallbacks

3. Monitor Costs

4. Use Provider Strengths

5. Set Appropriate Thresholds

Troubleshooting

Provider Not Available

Cost Higher Than Expected

Rate Limits Hit

One Provider Failing

Using Additional Providers via LiteLLM

Supported Additional Providers

Quick Start with LiteLLM

Using DeepSeek with cascadeflow

Using Google Gemini with cascadeflow

Cost Comparison

Complete Example

Benefits of LiteLLM Integration

When to Use LiteLLM vs Native Providers

Installation

Resources

Next Steps