diff --git a/community/UniFriend/1.png b/community/UniFriend/1.png new file mode 100644 index 0000000..bf954fb Binary files /dev/null and b/community/UniFriend/1.png differ diff --git a/community/UniFriend/2.png b/community/UniFriend/2.png new file mode 100644 index 0000000..16476aa Binary files /dev/null and b/community/UniFriend/2.png differ diff --git a/community/UniFriend/3.png b/community/UniFriend/3.png new file mode 100644 index 0000000..f13caea Binary files /dev/null and b/community/UniFriend/3.png differ diff --git a/community/UniFriend/4.png b/community/UniFriend/4.png new file mode 100644 index 0000000..f0f8a50 Binary files /dev/null and b/community/UniFriend/4.png differ diff --git a/community/UniFriend/README.md b/community/UniFriend/README.md new file mode 100644 index 0000000..c37bc3d --- /dev/null +++ b/community/UniFriend/README.md @@ -0,0 +1,175 @@ +# ๐Ÿ“„ UniFriend: PDF Chatbot for better Seat Selection(v0.0.1) + +This project is a full-stack Python application that allows users to interact with PDF documentsโ€” score card, seat matrix pdf and merit cuoff pdf โ€”through a chat interface. The application leverages open-source technologies and integrates with advanced language models to provide insightful, conversational responses based on the content of uploaded PDFs. + +## Why this problem statement +Recently my brother took admission in FE Engg. During his CAP rounds (MHT-CET 2024) we faced a lot of issues in seat selection. There were 6 pdfs of 2000+ pages each, by the end of round CAP round 3. 3 pdfs had seat matrix information, other 3 had merit cut-offs for the seats which seems very overwhelming for parents as well as students to crosscheck their rank against seat availability. This is a small effort to create such application to help next generation in finding their suitable match. + + +## Current Version : v0.0.1 +The current application is capable of taking 2 pdf as input and answer questions based on the comparison. basically analyse the score card and suggest college according to score. + +## Next Steps : +- v0.0.2 will have more efficient parsing marsheet/score card and cutoff pdf to find the suitable seat for candidate +- v0.0.3 will have capabilities of comparing 3 seat matrix pdfs, 3 cutoff matrix pdf and score card + +## ๐ŸŽฏ Main Features + +- **PDF Upload**: Users can upload PDF files directly through the web interface. +- **Text and Table Extraction**: The app extracts text and tables from the uploaded PDFs, handling complex documents with tabular data. +- **Data Processing**: Extracted content is processed and split into manageable chunks for embedding generation. +- **Embeddings Generation**: Utilizes Google's Generative AI Embeddings to convert text chunks into embeddings. +- **Vector Database Storage**: Embeddings are stored in ChromaDB, an open-source vector database for efficient similarity search. +- **Chat Interface**: Users can interact with the PDF content via a chat interface, asking questions and receiving answers generated by a language model. +- **Language Model Integration**: Integrates with a specified OpenAI model to generate human-like responses based on the PDF content. + +## ๐Ÿ› ๏ธ Technologies Used + +- **[Streamlit](https://streamlit.io/)**: Web application framework for creating interactive frontend interfaces. +- **[LangChain](https://langchain.readthedocs.io/)**: Framework for developing applications powered by language models. +- **[ChromaDB](https://www.trychroma.com/)**: Open-source embedding database for storing and querying vector embeddings. +- **[PyPDFLoader (LangChain)](https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/pdf.html)**: Utility for extracting text from PDF documents. +- **[tabula-py](https://tabula-py.readthedocs.io/en/latest/)**: Python wrapper for Tabula, enabling extraction of tables from PDFs. +- **[Pandas](https://pandas.pydata.org/)**: Data manipulation and analysis library, used here for handling tabular data. +- **[GoogleGenerativeAIEmbeddings](https://github.com/hwchase17/langchain/blob/master/libs/langchain/langchain/embeddings/google_palm.py)**: Generates embeddings using Google's embedding models. +- **[ChatOpenAI (LangChain)](https://langchain.readthedocs.io/en/latest/modules/models/chat/integrations/openai.html)**: Interface to interact with OpenAI's chat models. +- **[Python Dotenv](https://pypi.org/project/python-dotenv/)**: Facilitates the use of environment variables from a `.env` file. + + +## Glimpse + +![Img 1](1.png) +![Img 2](2.png) +![Img 3](3.png) +![Img 4](4.png) + + +## ๐Ÿš€ Getting Started + +### Prerequisites + +- **Python 3.7+** +- **Java**: Required by `tabula-py` for table extraction. + - Install from [Java Downloads](https://www.java.com/en/download/) and ensure it's added to your system's PATH. +- **Tune Studio API Key**: Sign up at [TuneStudio](https://studio.tune.app/playground) to obtain an API key. +- **Google API Key**: Required for Google's Generative AI Embeddings. + +### Installation Steps + +1. **Clone the Repository** (yet to init. this repo) + + ```bash + git clone https://github.com/yourusername/pdf-chatbot.git + cd pdf-chatbot + ``` + +2. **Create a Virtual Environment (Optional but Recommended)** + + ```bash + python -m venv venv + source venv/bin/activate # On Windows use `venv\Scripts\activate` + ``` + +3. **Install Dependencies** + + ```bash + pip install -r requirements.txt + ``` + + If a `requirements.txt` file is not provided, install the dependencies manually: + + ```bash + pip install streamlit langchain chromadb PyPDF2 tabula-py pandas tiktoken python-dotenv openai pypdf + ``` + + Additionally, install the Google Generative AI Embeddings library: + + ```bash + pip install langchain-google-genai + ``` + +4. **Set Up Environment Variables** + + Create a `.env` file in the project root directory and add your API keys: + + ```env + TUNE_API_KEY=your-tune-api-key + GOOGLE_API_KEY=your-google-api-key + ``` + + Replace `your-tune-api-key` and `your-google-api-key` with your actual API keys. + +5. **Run the Application** + + ```bash + streamlit run app.py + ``` + +### Usage + +- **Access the App**: Open your web browser and navigate to `http://localhost:8501` (or the URL provided in the terminal). +- **Upload a PDF**: Click on the **"Upload a PDF file"** button and select a PDF document from your computer. +- **Wait for Processing**: + - **Extracting Text**: The app will extract text from your PDF. + - **Extracting Tables**: Tables within the PDF are extracted and processed. + - **Generating Embeddings**: Text and tables are converted into embeddings and stored in ChromaDB. +- **Interact via Chat**: + - Use the chat interface to ask questions about the content of your PDF. + - The application will generate responses based on the information extracted from your document. +- **View Responses**: The assistant's answers will be displayed below the chat input. + +## โš™๏ธ How It Works + +1. **File Upload and Saving**: The user uploads a PDF file, which is saved temporarily for processing. + +2. **Extraction**: + - **Text Extraction**: Text content is extracted from the PDF using `PyPDFLoader`. + - **Table Extraction**: Tables are extracted using `tabula-py` and converted into CSV text format. + +3. **Document Preparation**: Extracted text and tables are encapsulated into `Document` objects provided by LangChain. + +4. **Text Splitting**: The documents are split into smaller chunks using `RecursiveCharacterTextSplitter` to optimize embedding generation and similarity search. + +5. **Embeddings Generation**: + - Utilizes `GoogleGenerativeAIEmbeddings` to generate embeddings for each text chunk. + - Embeddings capture the semantic meaning of the text, enabling effective similarity searches. + +6. **Vector Storage with ChromaDB**: + - Generated embeddings are stored in ChromaDB, a vector database that allows for efficient retrieval based on vector similarity. + +7. **Chat Interaction**: + - When a user inputs a question, the application performs a similarity search in ChromaDB to find relevant document chunks. + - The relevant context is compiled and fed into the prompt for the language model. + - `ChatOpenAI` generates a response using the specified OpenAI model. + +8. **Response Display**: The assistant's answer is presented to the user, providing insights based on the content of the uploaded PDF. + +## ๐Ÿ“š Dependencies and Libraries + +- **Language Models and Embeddings**: + - `langchain` + - `langchain-google-genai` +- **Data Processing and Extraction**: + - `PyPDF2` + - `tabula-py` + - `pandas` +- **Web Application**: + - `streamlit` + - `python-dotenv` +- **Vector Database**: + - `chromadb` +- **Others**: + - `tiktoken`: For tokenization processes within language models. + +## ๐Ÿ“ Notes + +- **Java Installation**: Ensure that Java is installed and properly configured, as it is required by `tabula-py` for table extraction. +- **API Keys Safety**: Keep your API keys secure and do not expose them publicly. +- **Data Privacy**: Be cautious with sensitive documents, as uploading them to the application will process and temporarily store their content. +- **ChromaDB Persistence**: Embeddings are stored in the `./chroma_db` directory. To clear stored embeddings, you can delete this directory. + +## ๐Ÿ› Troubleshooting + +- **Table Extraction Issues**: If tables are not being extracted properly, verify that Java is installed and consider adjusting `tabula-py` settings or using alternative libraries like `camelot-py`. +- **Module Errors**: Ensure all dependencies are installed. If you encounter a `ModuleNotFoundError`, install the missing library using `pip`. +- **API Errors**: Verify that your API keys are correct and have the necessary permissions. Check your internet connection if API requests fail. diff --git a/community/UniFriend/app.py b/community/UniFriend/app.py new file mode 100644 index 0000000..3d041a9 --- /dev/null +++ b/community/UniFriend/app.py @@ -0,0 +1,137 @@ +import streamlit as st +import os +from dotenv import load_dotenv +import tempfile + +from langchain.llms import OpenAI +from langchain.chat_models import ChatOpenAI +from langchain.document_loaders import PyPDFLoader +from langchain.docstore.document import Document +from langchain.indexes import VectorstoreIndexCreator +from langchain.vectorstores import Chroma +from langchain.text_splitter import RecursiveCharacterTextSplitter +from langchain.embeddings import OpenAIEmbeddings + +import chromadb +from chromadb.utils import embedding_functions +import pandas as pd +import tabula +from langchain_google_genai import GoogleGenerativeAIEmbeddings + +# Load environment variables +load_dotenv() +OPENAI_API_KEY = os.getenv('TUNE_API_KEY') +GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY') + +# Initialize Streamlit app +st.set_page_config(page_title="UniFriend", layout="wide") +st.title("๐Ÿ“„ UniFriend: PDF Chatbot for better Seat Selection") + +# Function to extract text from PDF +def extract_text_from_pdf(file_path): + loader = PyPDFLoader(file_path) + pages = loader.load() + return pages + +# Function to extract tables from PDF +def extract_tables_from_pdf(file_path): + try: + tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True) + documents = [] + for idx, df in enumerate(tables): + table_text = df.to_csv(index=False) + doc = Document( + page_content=table_text, + metadata={"source": f"Table {idx+1}"} + ) + documents.append(doc) + return documents + except Exception as e: + st.warning(f"Could not extract tables: {e}") + return [] + +# File uploaders +cutoff_pdf = st.file_uploader("Upload the Cutoff Marks PDF", type=["pdf"], key='cutoff') +scorecard_pdf = st.file_uploader("Upload the Scorecard PDF", type=["pdf"], key='scorecard') + +if cutoff_pdf is not None and scorecard_pdf is not None: + # Save uploaded files to temporary locations + with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_cutoff_file: + tmp_cutoff_file.write(cutoff_pdf.getvalue()) + tmp_cutoff_path = tmp_cutoff_file.name + + with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_scorecard_file: + tmp_scorecard_file.write(scorecard_pdf.getvalue()) + tmp_scorecard_path = tmp_scorecard_file.name + + # Extract text and tables from PDFs + st.info("Extracting text and tables from PDFs...") + cutoff_documents = extract_text_from_pdf(tmp_cutoff_path) + extract_tables_from_pdf(tmp_cutoff_path) + scorecard_documents = extract_text_from_pdf(tmp_scorecard_path) + extract_tables_from_pdf(tmp_scorecard_path) + + # Combine documents + documents = cutoff_documents + scorecard_documents + + # Split text for embeddings + st.info("Splitting text into chunks...") + text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) + split_docs = text_splitter.split_documents(documents) + + # Generate embeddings and store in ChromaDB + st.info("Generating embeddings and storing in vector database...") + embeddings = GoogleGenerativeAIEmbeddings(model='models/embedding-001', google_api_key=GOOGLE_API_KEY) + + persist_directory = './chroma_db' + vectordb = Chroma.from_documents( + documents=split_docs, + embedding=embeddings, + persist_directory=persist_directory + ) + + st.success("Embeddings generated and stored in ChromaDB!") + + # Initialize chat history + if 'history' not in st.session_state: + st.session_state['history'] = [] + + # Chat interface + st.header("๐Ÿ—จ๏ธ Chat with your PDFs") + user_question = st.text_input("Ask a question about the PDFs:", key='input') + + if user_question: + # Retrieve relevant documents + docs = vectordb.similarity_search(user_question, k=4) + + # Combine relevant documents + context = "\n\n".join([doc.page_content for doc in docs]) + + # Initialize chat model + chat_model = ChatOpenAI( + openai_api_key=OPENAI_API_KEY, + openai_api_base="https://proxy.tune.app/", + model_name="kaushikaakash04/tune-blob" + ) + + # Generate response + prompt = f"You are a helpful assistant analyzing PDF documents.\n\nContext:\n{context}\n\nQuestion:\n{user_question}\n\nAnswer:" + try: + response = chat_model.predict(prompt) + st.session_state['history'].append((user_question, response)) + st.write("**Answer:**", response) + except Exception as e: + st.error(f"Error generating response: {e}") + + # Display chat history + if st.session_state['history']: + st.subheader("Chat History") + for i, (q, a) in enumerate(st.session_state['history']): + st.write(f"**Q{i+1}:** {q}") + st.write(f"**A{i+1}:** {a}") +else: + st.info("Please upload both the Cutoff Marks PDF and the Scorecard PDF to proceed.") + +# Clean up temporary files +if cutoff_pdf is not None: + os.remove(tmp_cutoff_path) +if scorecard_pdf is not None: + os.remove(tmp_scorecard_path)