Docbot is an intelligent document assistant that processes and analyzes documents to provide accurate answers to user queries. It leverages multiple advanced language models and retrieval techniques to ensure high-quality responses.
- Multi-document processing and analysis
- Multi-route retrieval system for improved accuracy
- Advanced reranking mechanism
- Interactive chat interface
- Support for various document formats
- Streaming responses with real-time feedback
- Multiple Embedding Models:
- GTE-large-zh: Optimized for Chinese text understanding
- BGE-large-zh: Enhanced semantic comprehension
- BM25: Classical information retrieval algorithm
 
- Reranking: Uses BGE-reranker-large for context optimization
- Large Language Model: Powered by GPT for natural language generation
- 
Document Processing 📄: - Documents are loaded and split into manageable chunks
- Each chunk is processed through multiple embedding models
 
- 
Query Processing 🔎: - User queries are processed through multiple retrieval routes
- Results are combined and reranked for relevance
- Most relevant context is selected for the final response
 
- 
Response Generation 💬: - Selected context is combined with the user query
- LLM generates natural and accurate responses
- Responses are streamed in real-time
 
- Python 3.10+
- CUDA-capable GPU (recommended)
- uv package manager (recommended)
- OpenAI API key
- Clone Repository
git clone https://github.com/AbyssSkb/Docbot
cd Docbot- Install Dependencies
# Using uv (recommended)
uv sync
# Or using pip
pip install -r requirements.txt- Environment Setup
- Create a .envfile in the project root:
OPENAI_API_KEY=your_api_key
OPENAI_BASE_URL=your_base_url  # Optional
OPENAI_LLM_MODEL=your_preferred_model  # Default: gpt-4o- Document Setup
- Create a docfolder in the project root
- Place your documents in the docfolder
- Generate document indexes:
python create_index.pystreamlit run main.py- Open the provided URL in your web browser
- Enter your questions in the chat interface
- View real-time responses based on your documents
- 
Language Support 🌐: - Primary optimization for Chinese text
- English support can be enabled by switching to English-language models
- Consider language-specific requirements for your use case
 
- 
Text Processing 📝: - Jieba tokenizer is optimized for Chinese
- Basic English tokenization support
- May require adjustment for other languages
 
- 
Document Compatibility 📄: - Uses LangChain's DirectoryLoader
- Some document formats may have compatibility issues
- Verify support for your specific document types
 
Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.
This project is licensed under the MIT License - see the LICENSE file for details.