Skip to content

Clara: An agentic multimodal AI assistant that can see through your webcam, listen to your voice, think with Gemini, and speak back using ElevenLabs. Built with LangGraph, OpenCV, Groq, and Gradio.

License

Notifications You must be signed in to change notification settings

AbhaySingh71/Multimodal-Agentic-Assistant-Clara

Repository files navigation

👧🏼 Clara - Your Agentic AI Assistant

Clara is a real-time multimodal AI assistant capable of seeing, listening, understanding, and speaking back. It combines voice recognition, computer vision, large language models (Gemini), and real-time speech synthesis into a seamless and natural human-AI interaction system.

Screenshot 1 Screenshot 2

🌎 Features

  • 🎤 Voice Input - Continuous voice capture using speech_recognition + Groq Whisper.
  • 📷 Computer Vision - Analyzes live webcam feed with OpenCV and Gemini tools.
  • 🧠 AI Reasoning - Gemini-powered LangGraph agent that chooses when to see or speak.
  • 🎧 Voice Output - Real-time TTS with ElevenLabs and pydub.
  • 💬 Gradio Interface - Live chatbot window + webcam + voice loop.

⚙️ Technologies Used

  • LangChain + LangGraph + Gemini
  • OpenCV for webcam streaming and vision
  • speech_recognition + Groq Whisper for transcription
  • ElevenLabs API + pydub for high-quality speech synthesis
  • Gradio for user interface
  • dotenv, pydub, subprocess, asyncio, ffmpeg

📂 Folder Structure

Agentic-Assistant-Clara/
├── main.py                   # Gradio UI & orchestrator
├── ai_agent.py              # LangGraph agent + Gemini config
├── tools.py                 # Image analysis tool
├── speech_to_text.py        # Voice capture + transcription
├── text_to_speech.py        # ElevenLabs TTS + playback
├── .env                     # API keys and config
├── requirements.txt         # All dependencies
    

⚡ How It Works

  1. User speaks → audio is captured + transcribed via Groq Whisper.
  2. LangGraph Gemini agent decides whether to answer via text, or call the vision tool.
  3. If needed, the agent analyzes the current webcam frame using OpenCV and a custom image analysis tool.
  4. The response is converted to speech using ElevenLabs and played back to the user.
  5. All interactions are shown in a chat-style window (Gradio).

Setup Instructions

  1. Clone the repo
  2. git clone https://github.com/AbhaySingh71/Agentic-Assistant-Clara
  3. Install dependencies
  4. pip install -r requirements.txt
  5. Create a .env file with your keys:
  6. GROQ_API_KEY=your-groq-api-key
    ELEVENLABS_API_KEY=your-elevenlabs-key
    GOOGLE_API_KEY=your-gemini-api-key
            
  7. Launch the app
  8. python main.py

💡 Example Questions to Ask

  • "What’s behind me in the frame?"
  • "Do I look sleepy today?"
  • "What is the capital of Italy?"
  • "Describe what you see through the webcam."
  • "How many people are visible in the camera?"

📈 Potential Use Cases

  • AI-based visual companions
  • Visually-aware voice agents for the disabled
  • Educational AI for children ("Dora AI")
  • Live event narrators

🚀 Future Improvements

  • Browser-based deployment (Streamlit, HuggingFace Spaces)
  • Better face & emotion analysis
  • Multi-language voice support
  • Memory retention for contextual conversations

✍️ Author

Made with passion by Abhay Singh
GitHub | LinkedIn | Email

About

Clara: An agentic multimodal AI assistant that can see through your webcam, listen to your voice, think with Gemini, and speak back using ElevenLabs. Built with LangGraph, OpenCV, Groq, and Gradio.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages