Clara is a real-time multimodal AI assistant capable of seeing, listening, understanding, and speaking back. It combines voice recognition, computer vision, large language models (Gemini), and real-time speech synthesis into a seamless and natural human-AI interaction system.
![]() |
![]() |
|---|
- 🎤 Voice Input - Continuous voice capture using
speech_recognition+ Groq Whisper. - 📷 Computer Vision - Analyzes live webcam feed with OpenCV and Gemini tools.
- 🧠 AI Reasoning - Gemini-powered LangGraph agent that chooses when to see or speak.
- 🎧 Voice Output - Real-time TTS with
ElevenLabsandpydub. - 💬 Gradio Interface - Live chatbot window + webcam + voice loop.
LangChain+LangGraph+GeminiOpenCVfor webcam streaming and visionspeech_recognition+Groq Whisperfor transcriptionElevenLabs API+pydubfor high-quality speech synthesisGradiofor user interfacedotenv,pydub,subprocess,asyncio,ffmpeg
Agentic-Assistant-Clara/
├── main.py # Gradio UI & orchestrator
├── ai_agent.py # LangGraph agent + Gemini config
├── tools.py # Image analysis tool
├── speech_to_text.py # Voice capture + transcription
├── text_to_speech.py # ElevenLabs TTS + playback
├── .env # API keys and config
├── requirements.txt # All dependencies
- User speaks → audio is captured + transcribed via Groq Whisper.
- LangGraph Gemini agent decides whether to answer via text, or call the vision tool.
- If needed, the agent analyzes the current webcam frame using OpenCV and a custom image analysis tool.
- The response is converted to speech using ElevenLabs and played back to the user.
- All interactions are shown in a chat-style window (Gradio).
- Clone the repo
- Install dependencies
- Create a
.envfile with your keys: - Launch the app
git clone https://github.com/AbhaySingh71/Agentic-Assistant-Clara
pip install -r requirements.txt
GROQ_API_KEY=your-groq-api-key
ELEVENLABS_API_KEY=your-elevenlabs-key
GOOGLE_API_KEY=your-gemini-api-key
python main.py
- "What’s behind me in the frame?"
- "Do I look sleepy today?"
- "What is the capital of Italy?"
- "Describe what you see through the webcam."
- "How many people are visible in the camera?"
- AI-based visual companions
- Visually-aware voice agents for the disabled
- Educational AI for children ("Dora AI")
- Live event narrators
- Browser-based deployment (Streamlit, HuggingFace Spaces)
- Better face & emotion analysis
- Multi-language voice support
- Memory retention for contextual conversations

