TuneHQ · shivankurchavan · Sep 22, 2024 · Sep 22, 2024 · Sep 22, 2024
diff --git a/community/UniFriend/1.png b/community/UniFriend/1.png
diff --git a/community/UniFriend/2.png b/community/UniFriend/2.png
diff --git a/community/UniFriend/3.png b/community/UniFriend/3.png
diff --git a/community/UniFriend/4.png b/community/UniFriend/4.png
diff --git a/community/UniFriend/README.md b/community/UniFriend/README.md
@@ -0,0 +1,175 @@
+# 📄 UniFriend: PDF Chatbot for better Seat Selection(v0.0.1)
+
+This project is a full-stack Python application that allows users to interact with PDF documents— score card, seat matrix pdf and merit cuoff pdf —through a chat interface. The application leverages open-source technologies and integrates with advanced language models to provide insightful, conversational responses based on the content of uploaded PDFs.
+
+## Why this problem statement 
+Recently my brother took admission in FE Engg. During his CAP rounds (MHT-CET 2024) we faced a lot of issues in seat selection. There were 6 pdfs of 2000+ pages each, by the end of round CAP round 3. 3 pdfs had seat matrix information, other 3 had merit cut-offs for the seats which seems very overwhelming for parents as well as students to crosscheck their rank against seat availability. This is a small effort to create such application to help next generation in finding their suitable match.
+
+
+## Current Version :  v0.0.1
+The current application is capable of taking 2 pdf as input and answer questions based on the comparison. basically analyse the score card and suggest college according to score.
+
+## Next Steps :
+- v0.0.2 will have more efficient parsing marsheet/score card and cutoff pdf to find the suitable seat for candidate 
+- v0.0.3 will have capabilities of comparing 3 seat matrix pdfs, 3 cutoff matrix pdf and score card 
+
+## 🎯 Main Features
+
+- **PDF Upload**: Users can upload PDF files directly through the web interface.
+- **Text and Table Extraction**: The app extracts text and tables from the uploaded PDFs, handling complex documents with tabular data.
+- **Data Processing**: Extracted content is processed and split into manageable chunks for embedding generation.
+- **Embeddings Generation**: Utilizes Google's Generative AI Embeddings to convert text chunks into embeddings.
+- **Vector Database Storage**: Embeddings are stored in ChromaDB, an open-source vector database for efficient similarity search.
+- **Chat Interface**: Users can interact with the PDF content via a chat interface, asking questions and receiving answers generated by a language model.
+- **Language Model Integration**: Integrates with a specified OpenAI model to generate human-like responses based on the PDF content.
+
+## 🛠️ Technologies Used
+
+- **[Streamlit](https://streamlit.io/)**: Web application framework for creating interactive frontend interfaces.
+- **[LangChain](https://langchain.readthedocs.io/)**: Framework for developing applications powered by language models.
+- **[ChromaDB](https://www.trychroma.com/)**: Open-source embedding database for storing and querying vector embeddings.
+- **[PyPDFLoader (LangChain)](https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/pdf.html)**: Utility for extracting text from PDF documents.
+- **[tabula-py](https://tabula-py.readthedocs.io/en/latest/)**: Python wrapper for Tabula, enabling extraction of tables from PDFs.
+- **[Pandas](https://pandas.pydata.org/)**: Data manipulation and analysis library, used here for handling tabular data.
+- **[GoogleGenerativeAIEmbeddings](https://github.com/hwchase17/langchain/blob/master/libs/langchain/langchain/embeddings/google_palm.py)**: Generates embeddings using Google's embedding models.
+- **[ChatOpenAI (LangChain)](https://langchain.readthedocs.io/en/latest/modules/models/chat/integrations/openai.html)**: Interface to interact with OpenAI's chat models.
+- **[Python Dotenv](https://pypi.org/project/python-dotenv/)**: Facilitates the use of environment variables from a `.env` file.
+
+
+## Glimpse 
+
+![Img 1](1.png)
+![Img 2](2.png)
+![Img 3](3.png)
+![Img 4](4.png)
+
+
+## 🚀 Getting Started
+
+### Prerequisites
+
+- **Python 3.7+**
+- **Java**: Required by `tabula-py` for table extraction.
+  - Install from [Java Downloads](https://www.java.com/en/download/) and ensure it's added to your system's PATH.
+- **Tune Studio API Key**: Sign up at [TuneStudio](https://studio.tune.app/playground) to obtain an API key.
+- **Google API Key**: Required for Google's Generative AI Embeddings.
+
+### Installation Steps
+
+1. **Clone the Repository** (yet to init. this repo)
+
+   ```bash
+   git clone https://github.com/yourusername/pdf-chatbot.git
+   cd pdf-chatbot
+   ```
+
+2. **Create a Virtual Environment (Optional but Recommended)**
+
+   ```bash
+   python -m venv venv
+   source venv/bin/activate  # On Windows use `venv\Scripts\activate`
+   ```
+
+3. **Install Dependencies**
+
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+   If a `requirements.txt` file is not provided, install the dependencies manually:
+
+   ```bash
+   pip install streamlit langchain chromadb PyPDF2 tabula-py pandas tiktoken python-dotenv openai pypdf
+   ```
+
+   Additionally, install the Google Generative AI Embeddings library:
+
+   ```bash
+   pip install langchain-google-genai
+   ```
+
+4. **Set Up Environment Variables**
+
+   Create a `.env` file in the project root directory and add your API keys:
+
+   ```env
+   TUNE_API_KEY=your-tune-api-key
+   GOOGLE_API_KEY=your-google-api-key
+   ```
+
+   Replace `your-tune-api-key` and `your-google-api-key` with your actual API keys.
+
+5. **Run the Application**
+
+   ```bash
+   streamlit run app.py
+   ```
+
+### Usage
+
+- **Access the App**: Open your web browser and navigate to `http://localhost:8501` (or the URL provided in the terminal).
+- **Upload a PDF**: Click on the **"Upload a PDF file"** button and select a PDF document from your computer.
+- **Wait for Processing**:
+  - **Extracting Text**: The app will extract text from your PDF.
+  - **Extracting Tables**: Tables within the PDF are extracted and processed.
+  - **Generating Embeddings**: Text and tables are converted into embeddings and stored in ChromaDB.
+- **Interact via Chat**:
+  - Use the chat interface to ask questions about the content of your PDF.
+  - The application will generate responses based on the information extracted from your document.
+- **View Responses**: The assistant's answers will be displayed below the chat input.
+
+## ⚙️ How It Works
+
+1. **File Upload and Saving**: The user uploads a PDF file, which is saved temporarily for processing.
+
+2. **Extraction**:
+   - **Text Extraction**: Text content is extracted from the PDF using `PyPDFLoader`.
+   - **Table Extraction**: Tables are extracted using `tabula-py` and converted into CSV text format.
+
+3. **Document Preparation**: Extracted text and tables are encapsulated into `Document` objects provided by LangChain.
+
+4. **Text Splitting**: The documents are split into smaller chunks using `RecursiveCharacterTextSplitter` to optimize embedding generation and similarity search.
+
+5. **Embeddings Generation**:
+   - Utilizes `GoogleGenerativeAIEmbeddings` to generate embeddings for each text chunk.
+   - Embeddings capture the semantic meaning of the text, enabling effective similarity searches.
+
+6. **Vector Storage with ChromaDB**:
+   - Generated embeddings are stored in ChromaDB, a vector database that allows for efficient retrieval based on vector similarity.
+
+7. **Chat Interaction**:
+   - When a user inputs a question, the application performs a similarity search in ChromaDB to find relevant document chunks.
+   - The relevant context is compiled and fed into the prompt for the language model.
+   - `ChatOpenAI` generates a response using the specified OpenAI model.
+
+8. **Response Display**: The assistant's answer is presented to the user, providing insights based on the content of the uploaded PDF.
+
+## 📚 Dependencies and Libraries
+
+- **Language Models and Embeddings**:
+  - `langchain`
+  - `langchain-google-genai`
+- **Data Processing and Extraction**:
+  - `PyPDF2`
+  - `tabula-py`
+  - `pandas`
+- **Web Application**:
+  - `streamlit`
+  - `python-dotenv`
+- **Vector Database**:
+  - `chromadb`
+- **Others**:
+  - `tiktoken`: For tokenization processes within language models.
+
+## 📝 Notes
+
+- **Java Installation**: Ensure that Java is installed and properly configured, as it is required by `tabula-py` for table extraction.
+- **API Keys Safety**: Keep your API keys secure and do not expose them publicly.
+- **Data Privacy**: Be cautious with sensitive documents, as uploading them to the application will process and temporarily store their content.
+- **ChromaDB Persistence**: Embeddings are stored in the `./chroma_db` directory. To clear stored embeddings, you can delete this directory.
+
+## 🐛 Troubleshooting
+
+- **Table Extraction Issues**: If tables are not being extracted properly, verify that Java is installed and consider adjusting `tabula-py` settings or using alternative libraries like `camelot-py`.
+- **Module Errors**: Ensure all dependencies are installed. If you encounter a `ModuleNotFoundError`, install the missing library using `pip`.
+- **API Errors**: Verify that your API keys are correct and have the necessary permissions. Check your internet connection if API requests fail.
diff --git a/community/UniFriend/app.py b/community/UniFriend/app.py
@@ -0,0 +1,137 @@
+import streamlit as st
+import os
+from dotenv import load_dotenv
+import tempfile
+
+from langchain.llms import OpenAI
+from langchain.chat_models import ChatOpenAI
+from langchain.document_loaders import PyPDFLoader
+from langchain.docstore.document import Document
+from langchain.indexes import VectorstoreIndexCreator
+from langchain.vectorstores import Chroma
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain.embeddings import OpenAIEmbeddings
+
+import chromadb
+from chromadb.utils import embedding_functions
+import pandas as pd
+import tabula
+from langchain_google_genai import GoogleGenerativeAIEmbeddings
+
+# Load environment variables
+load_dotenv()
+OPENAI_API_KEY = os.getenv('TUNE_API_KEY')
+GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
+
+# Initialize Streamlit app
+st.set_page_config(page_title="UniFriend", layout="wide")
+st.title("📄 UniFriend: PDF Chatbot for better Seat Selection")
+
+# Function to extract text from PDF
+def extract_text_from_pdf(file_path):
+    loader = PyPDFLoader(file_path)
+    pages = loader.load()
+    return pages
+
+# Function to extract tables from PDF
+def extract_tables_from_pdf(file_path):
+    try:
+        tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)
+        documents = []
+        for idx, df in enumerate(tables):
+            table_text = df.to_csv(index=False)
+            doc = Document(
+                page_content=table_text,
+                metadata={"source": f"Table {idx+1}"}
+            )
+            documents.append(doc)
+        return documents
+    except Exception as e:
+        st.warning(f"Could not extract tables: {e}")
+        return []
+
+# File uploaders
+cutoff_pdf = st.file_uploader("Upload the Cutoff Marks PDF", type=["pdf"], key='cutoff')
+scorecard_pdf = st.file_uploader("Upload the Scorecard PDF", type=["pdf"], key='scorecard')
+
+if cutoff_pdf is not None and scorecard_pdf is not None:
+    # Save uploaded files to temporary locations
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_cutoff_file:
+        tmp_cutoff_file.write(cutoff_pdf.getvalue())
+        tmp_cutoff_path = tmp_cutoff_file.name
+
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_scorecard_file:
+        tmp_scorecard_file.write(scorecard_pdf.getvalue())
+        tmp_scorecard_path = tmp_scorecard_file.name
+
+    # Extract text and tables from PDFs
+    st.info("Extracting text and tables from PDFs...")
+    cutoff_documents = extract_text_from_pdf(tmp_cutoff_path) + extract_tables_from_pdf(tmp_cutoff_path)
+    scorecard_documents = extract_text_from_pdf(tmp_scorecard_path) + extract_tables_from_pdf(tmp_scorecard_path)
+
+    # Combine documents
+    documents = cutoff_documents + scorecard_documents
+
+    # Split text for embeddings
+    st.info("Splitting text into chunks...")
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
+    split_docs = text_splitter.split_documents(documents)
+
+    # Generate embeddings and store in ChromaDB
+    st.info("Generating embeddings and storing in vector database...")
+    embeddings = GoogleGenerativeAIEmbeddings(model='models/embedding-001', google_api_key=GOOGLE_API_KEY)
+
+    persist_directory = './chroma_db'
+    vectordb = Chroma.from_documents(
+        documents=split_docs,
+        embedding=embeddings,
+        persist_directory=persist_directory
+    )
+
+    st.success("Embeddings generated and stored in ChromaDB!")
+
+    # Initialize chat history
+    if 'history' not in st.session_state:
+        st.session_state['history'] = []
+
+    # Chat interface
+    st.header("🗨️ Chat with your PDFs")
+    user_question = st.text_input("Ask a question about the PDFs:", key='input')
+
+    if user_question:
+        # Retrieve relevant documents
+        docs = vectordb.similarity_search(user_question, k=4)
+
+        # Combine relevant documents
+        context = "\n\n".join([doc.page_content for doc in docs])
+
+        # Initialize chat model
+        chat_model = ChatOpenAI(
+            openai_api_key=OPENAI_API_KEY,
+            openai_api_base="https://proxy.tune.app/",
+            model_name="kaushikaakash04/tune-blob"
+        )
+
+        # Generate response
+        prompt = f"You are a helpful assistant analyzing PDF documents.\n\nContext:\n{context}\n\nQuestion:\n{user_question}\n\nAnswer:"
+        try:
+            response = chat_model.predict(prompt)
+            st.session_state['history'].append((user_question, response))
+            st.write("**Answer:**", response)
+        except Exception as e:
+            st.error(f"Error generating response: {e}")
+
+    # Display chat history
+    if st.session_state['history']:
+        st.subheader("Chat History")
+        for i, (q, a) in enumerate(st.session_state['history']):
+            st.write(f"**Q{i+1}:** {q}")
+            st.write(f"**A{i+1}:** {a}")
+else:
+    st.info("Please upload both the Cutoff Marks PDF and the Scorecard PDF to proceed.")
+
+# Clean up temporary files
+if cutoff_pdf is not None:
+    os.remove(tmp_cutoff_path)
+if scorecard_pdf is not None:
+    os.remove(tmp_scorecard_path)