Ever wished you had more control over how your applications interact with Large Language Models (LLMs)? cot_proxy is a smart, lightweight proxy that sits between your app and your LLM, giving you powerful control without changing your client code.
Qwen3 models have a "reasoning" mode (activated by /think) and a "normal" mode (activated by /no_think), each requiring different sampling parameters for optimal performance. This makes them difficult to use with applications like Cline or RooCode that don't allow setting these parameters.
With cot_proxy, you can:
- Create simplified model names like
Qwen3-ThinkingandQwen3-Normalthat automatically:- Apply the perfect sampling parameters for each mode
- Append
/thinkor/no_thinkto your prompts - Strip out
<think>...</think>tags when needed
- All without changing a single line of code in your client application!
Case 1: Standardize LLM Interactions Across Tools
- Problem: Your team uses multiple tools (web UIs, CLI tools, custom apps) to interact with LLMs, leading to inconsistent results.
- Solution: Configure cot_proxy with standardized parameters and prompts for each use case, then point all tools to it.
Case 2: Clean Up Verbose Model Outputs
- Problem: Your model includes detailed reasoning in
<think>tags, but your application only needs the final answer. - Solution: Enable think tag filtering in cot_proxy to automatically remove the reasoning, delivering clean responses.
Case 3: Simplify Complex Model Management
- Problem: You need to switch between different models and configurations based on the task.
- Solution: Create intuitive "pseudo-models" like
creative-writingorfactual-qathat map to the right models with the right parameters.
- Smart Request Modification: Automatically adjust parameters and append text to prompts
- Response Filtering: Remove thinking/reasoning tags from responses
- Model Name Mapping: Create intuitive pseudo-models that map to actual models
- Streaming Support: Works with both streaming and non-streaming responses
- Easy Deployment: Dockerized for quick setup
# Clone the repository
git clone https://github.com/bold84/cot_proxy.git
cd cot_proxy
# Copy the example environment file (includes ready-to-use Qwen3 configurations!)
cp .env.example .env
# Edit the .env file to configure your settings (optional)
# nano .env
# Start the service
docker-compose up -d# Build the Docker image
docker build -t cot-proxy .
# Run the container
docker run -p 3000:5000 cot-proxyThe .env.example file includes pre-configured settings for Qwen3 models with both thinking and non-thinking modes. To use them:
- Make sure you've copied
.env.exampleto.env - Update the
TARGET_BASE_URLin your.envfile to point to your Qwen3 API - Make requests to your proxy using the pre-configured model names:
# For thinking mode with optimal parameters
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY" \
-d '{
"model": "Qwen3-32B-Thinking",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": true
}'
# For non-thinking mode with optimal parameters
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY" \
-d '{
"model": "Qwen3-32B-Non-Thinking",
"messages": [{"role": "user", "content": "Explain quantum computing"}],
"stream": true
}'# Verify your proxy is running and can connect to your target API
curl http://localhost:3000/health# Run with a different target API endpoint
docker run -e TARGET_BASE_URL="http://your-api:8080/" -p 3000:5000 cot-proxy# Enable debug logging to see detailed request/response information
docker run -e DEBUG=true -p 3000:5000 cot-proxyThe proxy provides detailed error responses for various scenarios:
# Test with invalid API key to see authentication error handling
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer invalid_key" \
-d '{
"model": "your-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'Error responses include:
- 400: Invalid JSON request
- 502: Connection/forwarding errors
- 503: Health check failures
- 504: Connection timeouts
- Original error codes from target API (401, 403, etc.)
Want to contribute or customize cot_proxy for your specific needs? Here's how to set up a development environment:
# Clone the repository
git clone https://github.com/bold84/cot_proxy.git
cd cot_proxy
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the development server
python cot_proxy.pycot_proxy includes a comprehensive test suite to ensure everything works as expected:
# Install test dependencies
pip install pytest pytest-flask pytest-mock responses
# Run all tests
pytest
# Run specific test files
pytest tests/test_stream_buffer.py
pytest tests/test_llm_params.pyThe test suite covers:
- Stream buffer functionality for think tag filtering
- LLM parameter handling and overrides
- Request/response proxying (both streaming and non-streaming)
- Error handling scenarios
Configuration is primarily managed via an .env file in the root of the project. Create a .env file by copying the provided .env.example.
The docker-compose.yml file defines default values for these environment variables using the ${VARIABLE:-default_value} syntax. This means:
- If a variable is set in your shell environment when you run
docker-compose, that value takes the highest precedence. - If not set in the shell, Docker Compose will look for the variable in the
.envfile. If found, that value is used. - If the variable is not found in the shell or the
.envfile, thedefault_valuespecified indocker-compose.yml(e.g.,falsein${DEBUG:-false}) will be used.
The following variables can be set (values in .env.example are illustrative, and defaults are shown from docker-compose.yml):
HOST_PORT: The port on the host machine to expose the proxy service (default:3000). The container internal port is fixed at5000.TARGET_BASE_URL: Target API endpoint (default in example:http://your-model-server:8080/)DEBUG: Enable debug logging (trueorfalse, default in example:false)API_REQUEST_TIMEOUT: Timeout for API requests in seconds (default in example:3000)LLM_PARAMS: Comma-separated parameter overrides in formatkey=value. Model-specific groups separated by semicolons.- Standard LLM parameters like
temperature,top_k, etc., can be overridden per model. - Special parameters:
think_tag_startandthink_tag_end: Customize the tags used for stripping thought processesenable_think_tag_filtering: Set totrueto enable filtering of think tags (default:false)upstream_model_name: Replace the requested model with a different model name when forwarding to the APIappend_to_last_user_message: Add text to the last user message or create a new one if needed
- Example in
.env.example:LLM_PARAMS=model=default,temperature=0.7,enable_think_tag_filtering=true,think_tag_start=<think>,think_tag_end=</think> - Example for Qwen3 (commented out in
.env.example):LLM_PARAMS=model=Qwen3-32B-Non-Thinking,upstream_model_name=Qwen3-32B,temperature=0.7,top_k=20,top_p=0.8,enable_think_tag_filtering=true,think_tag_start=<think>,think_tag_end=</think>,append_to_last_user_message=\n\n/no_think
- Standard LLM parameters like
THINK_TAG: Global default start tag for stripping thought processes (e.g.,<think>). Overridden by model-specificthink_tag_startinLLM_PARAMS. Defaults to<think>if not set via this variable orLLM_PARAMS. (Example in.env.example:<think>)THINK_END_TAG: Global default end tag for stripping thought processes (e.g.,</think>). Overridden by model-specificthink_tag_endinLLM_PARAMS. Defaults to</think>if not set via this variable orLLM_PARAMS. (Example in.env.example:</think>)
Environment Variable Precedence (for Docker Compose): Docker Compose resolves environment variables in the following order of precedence (highest first):
- Variables set in your shell environment when running
docker-compose up(e.g.,DEBUG=true docker-compose up). - Variables defined in the
.envfile located in the project directory. - Default values specified using the
${VARIABLE:-default_value}syntax within theenvironmentsection ofdocker-compose.yml. - If a variable is not defined through any of the above methods, it will be unset for the container, and the application (
cot_proxy.py) might rely on its own internal hardcoded defaults (e.g., for think tags).
Think Tag Configuration Priority (as seen by cot_proxy.py after Docker Compose resolution):
- Model-specific
think_tag_start/think_tag_endparsed from theLLM_PARAMSenvironment variable. (TheLLM_PARAMSvariable itself is sourced according to the Docker Compose precedence above). - Global
THINK_TAG/THINK_END_TAGenvironment variables. (These are also sourced according to the Docker Compose precedence). - Hardcoded defaults (
<think>and</think>) withincot_proxy.pyif the corresponding environment variables are not ultimately set.
Docker Usage:
The docker-compose.yml is configured to use this precedence. It loads variables from the .env file and applies defaults from its environment section if variables are not otherwise set.
Simply run:
docker-compose up
# or for detached mode
docker-compose up -dIf you need to override a variable from the .env file for a specific docker run command (less common when using Docker Compose), you can still use the -e flag:
# Example: Overriding DEBUG for a single run, assuming you're not using docker-compose here
docker run -e DEBUG=true -p 3000:5000 cot-proxyHowever, with the current docker-compose.yml setup, managing variables through the .env file (to override defaults) is the recommended method.
The service uses Gunicorn with the following settings:
- 4 worker processes
- 3000 second timeout for long-running requests
- SSL verification enabled
- Automatic error recovery
- Health check endpoint for monitoring
The proxy allows you to define different configurations for different models using the LLM_PARAMS environment variable. This enables you to:
- Create "pseudo-models" that map to real upstream models with specific parameters
- Apply different parameter sets to different models
- Configure think tag filtering differently per model
Example configuration for multiple models:
LLM_PARAMS=model=Qwen3-32B-Non-Thinking,upstream_model_name=Qwen3-32B,temperature=0.7,top_k=20,top_p=0.8,enable_think_tag_filtering=true,think_tag_start=<think>,think_tag_end=</think>,append_to_last_user_message=\n\n/no_think;model=Qwen3-32B-Thinking,upstream_model_name=Qwen3-32B,temperature=0.6,top_k=20,top_p=0.95,append_to_last_user_message=\n\n/think,enable_think_tag_filtering=true,think_tag_start=<think>,think_tag_end=</think>
This creates two pseudo-models:
Qwen3-32B-Non-Thinking: Maps toQwen3-32Bwith parameters optimized for non-thinking modeQwen3-32B-Thinking: Maps toQwen3-32Bwith parameters optimized for thinking mode
The proxy can filter out content enclosed in think tags from model responses. This is useful for:
- Removing internal reasoning/thinking from final outputs
- Cleaning up responses for end users
- Maintaining a clean conversation history
Think tag filtering can be:
- Enabled globally via environment variables
- Configured per model via
LLM_PARAMS - Enabled/disabled per model using
enable_think_tag_filtering
The proxy uses an efficient streaming buffer to handle think tags that span multiple chunks in streaming responses.
You can create "pseudo-models" that map to actual upstream models using the upstream_model_name parameter:
model=my-custom-model,upstream_model_name=actual-model-name
This allows you to:
- Create simplified model names for end users
- Hide implementation details of which models you're actually using
- Easily switch underlying models without changing client code
The proxy can automatically append content to the last user message or create a new user message if needed. This is useful for:
- Adding system instructions without changing client code
- Injecting special commands or flags (like
/thinkor/no_think) - Standardizing prompts across different clients
Example:
append_to_last_user_message=\n\nAlways respond in JSON format.
# In your .env file:
LLM_PARAMS=model=Qwen3-Thinking,upstream_model_name=Qwen3-32B,temperature=0.7,enable_think_tag_filtering=false,append_to_last_user_message=\n\n/think;model=Qwen3-Clean,upstream_model_name=Qwen3-32B,temperature=0.7,enable_think_tag_filtering=true,append_to_last_user_message=\n\n/no_think
# Client can then request either model:
curl http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $YOUR_API_KEY" \
-d '{
"model": "Qwen3-Thinking",
"messages": [{"role": "user", "content": "Solve: 25 Γ 13"}],
"stream": true
}'- Python 3.9+: The core runtime environment
- Flask 3.0.2: Lightweight web framework for handling HTTP requests
- Requests 2.31.0: HTTP library for forwarding requests to the target API
- Gunicorn 21.2.0: Production-grade WSGI server (used in Docker deployment)
- Testing tools:
- pytest: Python testing framework
- pytest-flask: Flask integration for pytest
- pytest-mock: Mocking support for pytest
- responses: Mock HTTP responses
We welcome contributions to make cot_proxy even better! Here's how you can help:
- Star the repository: Show your support and help others discover the project
- Report issues: Found a bug or have a feature request? Open an issue on GitHub
- Submit pull requests: Code improvements and bug fixes are always welcome
- Share your use cases: Let us know how you're using
cot_proxyin your projects
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or feedback, please open an issue on the GitHub repository: https://github.com/bold84/cot_proxy