δΈζζζ‘£: README_zh.md
A multifunctional AI assistant project based on OpenAI Agents SDK, integrating Speech-to-Text (STT), Text-to-Speech (TTS), and podcast processing capabilities.
- Multi-Agent Collaboration: Task distribution through agent system
- Speech-to-Text: Audio transcription using Podcast MCP server
- Text-to-Speech: Text-to-speech synthesis with Minimax API integration
- Podcast Processing: Support for podcast audio file transcription and analysis
- Real-time Streaming Chat: Gradio-based streaming chat interface
- Asynchronous Processing: Full async support for enhanced performance
- Event-Driven Architecture: Event handler chain pattern for complex message flow processing
- MCP Integration: Support for multiple Model Context Protocol servers
- Multi-Model Support: Configurable AI models (DeepSeek R1, etc.)
- Real-time Monitoring: Complete event tracking and debugging information
Since the OpenAI Agent SDK uses asynchronous loop waiting in multi-agent scenarios, users cannot intuitively see which stage the Agent is currently processing. This project uses a chain of responsibility pattern to map streaming events to agent messages, allowing users to visually see which stage is currently being processed.
I have also submitted a gradio-example to the OpenAI Agent SDK project: openai/openai-agents-python#888. The example there is simpler and clearer, and users can run that project first.
- Python: 3.12+
- UI Framework: Gradio 5.33.1+
- AI SDK: OpenAI Agents SDK with LiteLLM
- Package Manager: UV
- Async Processing: AsyncIO
- Configuration: Python-dotenv
- API Integrations:
- DeepSeek API (Inference Model)
- Minimax API (Text-to-Speech)
- Podcast MCP (Speech-to-Text)
podcast-agent/
βββ src/
β βββ app.py # Main application entry
β βββ ai/ # AI agent modules
β β βββ sst_agent.py # Speech-to-text agent
β β βββ tts_agent.py # Text-to-speech agent
β β βββ instructions.py # Dynamic instruction generation
β βββ ui/ # User interface modules
β β βββ gradio_ui.py # Basic Gradio interface
β β βββ gradio_agent_ui.py # Agent-specific interface
β βββ model/ # Model configurations
β β βββ model_config.py # Model settings and configurations
β βββ event/ # Event handling
β βββ event_handler.py # Event handler chain
βββ pyproject.toml # Project configuration and dependencies
βββ README.md # Project documentation (English)
βββ README_zh.md # Project documentation (Chinese)
- Python 3.12+
- UV Package Manager
# Install UV package manager (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the project
git clone https://github.com/Sucran/podcast-agent.git
cd podcast-agent
# Install dependencies
uv syncDeploy podcast-mcp first, otherwise the speech-to-text agent cannot function properly.
View GitHub repository: https://github.com/Sucran/modal-transcriber-mcp.git
Create a .env file and configure the following environment variables:
# DeepSeek API Configuration
DS_API_KEY=your_deepseek_api_key
# Minimax API Configuration
MINIMAX_API_KEY=your_minimax_api_key
MINIMAX_MCP_BASE_PATH=/path/to/your/files
# Other Configuration
OPENAI_API_KEY=your_openai_api_key_if_needed# Activate virtual environment
source .venv/bin/activate
# Start the application
python src/app.pyThe application will start at http://localhost:8000.
- Start Application: Run
python src/app.py - Open Browser: Navigate to
http://localhost:8000 - Start Chatting: Enter messages in the chat interface
User: Help me transcribe this podcast https://www.xiaoyuzhoufm.com/episode/xxxxx
The system will automatically:
- Switch to the speech-to-text agent
- Download and process the audio file
- Perform speech transcription (supports async mode)
- Return transcription results
User: Help me convert this text to speech: [text content]
The system will:
- Switch to the text-to-speech agent
- Call Minimax API to generate audio
- Return audio file download link
The application supports intelligent agent switching:
- Planning Agent: Handles general conversation and task planning
- STT Agent: Specialized for speech-to-text tasks
- TTS Agent: Specialized for text-to-speech tasks
Configure in src/model/model_config.py:
- Inference model selection (default: DeepSeek R1)
- Model parameters (temperature, top_p, etc.)
- Custom API endpoints
The project supports two types of MCP servers:
- Podcast MCP: HTTP streaming service for audio processing
- Minimax MCP: Standard IO service for speech synthesis
Event handler chain supports multiple event types:
- Tool call events
- Agent switching events
- Reasoning process events
- MCP approval events
- SSL Errors: Tracing is disabled by default to avoid SSL issues
- Large Audio Files: Automatic segmentation to avoid token limits
- Async Processing: Ensure async mode is used for audio transcription
The application starts with debug mode enabled by default, outputting detailed event information to console:
- Agent switching logs
- Tool call details
- MCP service status
Welcome to submit Issues and Pull Requests to improve this project!
Apache-2.0 License