Experimental comparison of Numerical & Generative Techniques for CDQA and IR, conducted as part of DSC 210: Numerical Linear Algebra course (Fall 2023) at UC San Diego.
This project focuses on evaluating Closed Domain Question Answering (CDQA) through a comprehensive exploration of Numerical Linear Algebra (NLA) and state-of-the-art (SOTA) techniques. Leveraging NLA, methods like Latent Semantic Indexing (LSI) were employed to process textual data, reducing dimensionality through Singular Value Decomposition ( SVD). The study extensively compared the NLA approach with SOTA methods, including Retrieval Augmented Generation (RAG) and Large Language Models (LLMs) such as GPT-3.5-turbo. Experiments involved a variety of datasets of questions, documents and ground truth, assessing accuracy and response times. Multiple sub-experiments were also performed to evaluate the technical soundness of main experimental strategy. Results indicated that while NLA offered a foundational methodology, modern techniques significantly outperformed it, reaffirming the importance of staying abreast of evolving methodologies in CDQA. The project also validated the choice of using RAG with LLMs and established VectorDB as a superior Information Retrieval system. The findings contribute to ongoing efforts in refining and tailoring models for CDQA, aligning them with the dynamic challenges posed by diverse knowledge domains.
- 
Python Requirements:
- A Python 3.9.6+ kernel is recommended to run this project. Since the project mainly consists of 
.ipynbnotebooks, we recommend the installation of Jupyter Notebook. - Clone the repository and navigate to the root folder of the project.
 - Optional: Create a virtual environment for this project (use 
venvorconda). - Install all dependencies from the requirements.txt
file with the command 
pip install -r requirements.txt. 
 - A Python 3.9.6+ kernel is recommended to run this project. Since the project mainly consists of 
 - 
OpenAI Token Requirements:
- A valid OpenAI token would be necessary (to generate OpenAI embeddings and GPT/Davinci LLM responses) to run majority of the project.
 - You'll need to add the 
OPENAI_API_KEYto a.envfile. Anenv.exampleis given in the root directory of the repo (open to see instructions). - To get the API token, please follow instructions in OpenAI API.
 
 - 
LLaMa-7B Model Requirements: To run LLaMa-7B model locally, please download it using the following commands:
cd ./src/ curl -L "https://replicate.fyi/install-llama-cpp" | bash wget https://huggingface.co/localmodels/Llama-2-7B-Chat-ggml/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_S.bin ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.bin ./llama.cpp/convert-llama-ggml-to-gguf.py --eps 1e-5 -i ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.bin -o ./llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.gguf.bin
The model will be available at the path:
src/llama.cpp/models/llama-2-7b-chat.ggmlv3.q4_K_S.gguf.bin. - 
Once these steps are complete, please refer to the next section to understand the project layout.
 
- 
Our project is primarily composed of Jupyter notebooks each performing a specific subpart of the project. We have all the experiments in the
srcfolder. Experiments are organized based on the datasets used. i.e.src/cisiwill have all experiments performed on the CISI dataset. - 
Refer to the Project Structure to understand the file organization and hierarchy.
 - 
The Results Notebook has the compiled results of all our experiments and sub-experiments. Running this notebook will read, process, and compute all results from the responses and data pickle files for each dataset using the corresponding models. These results have been added in our Notion report as well.
 - 
If you wish to separately run individual experimental sections for a specific dataset:
- Navigate to the corresponding dataset subdirectory, i.e., 
src/{dataset}, for example,src/nfcorpus. src/{dataset}/notebookscontains the notebooks explaining every step of processing towards CDQA.- The notebook names are self-explanatory. For example, 
src/{dataset}/notebooks/preprocess_data.ipynbhas the code to pre-process, clean and pickle the documents, queries, and ground truth's for the corresponding dataset. - Similarily 
src/{dataset}/notebooks/generate_embeddings.ipynbgenerates all LSI (reduced vectors), and OpenAI embeddings. - To run the experiments without RAG, refer to notebooks named 
src/{dataset}/notebooks/llm_wo_rag.ipynb. - Similarily, if you would like to run experiments with RAG, the NLA (LSI) approach for RAG is
given 
src/{dataset}/notebooks/llm_w_rag_exact_search.ipynband the SOTA VectorDB approach (with FAISS) is given insrc/{dataset}/notebooks/llm_w_rag_faiss.ipynb. 
 - Navigate to the corresponding dataset subdirectory, i.e., 
 - 
Datastores & Pickle Files: If you wish to view the intermediate outputs after each stage of the RAG pipeline, You can do so by manually loading and viewing pickle files corresponding to the steps.
src/{dataset}/dataset/: Pickled, pre-processed, cleaned documents, queries, and ground truth files for the dataset, i.e., output of thepreprocess_data.ipynbnotebook.src/{dataset}/embeddings/: Pickled OpenAI & LSI embeddings in corresponding folders, i.e., the output ofgenerate_embeddings.ipynbnotebook.src/{dataset}/ir_techniques/: Index files necessary for IR in the RAG pipeline for each mode of IR, i.e., NLA technique (output of LSI + Truncated SVD) and SOTA techniques (FAISS/ANNOYVectorDB index files).src/{dataset}/responses/: LLM-specific responses for the queries in the dataset. This includes pickled responses from all LLMs we ran for that dataset. For example,src/nfcorpus/responses/gpt-3.5-turbo-instruct/llm_w_rag_exact_search.pklhas the responses returned bygpt-3.5-turbowith LSI + SVD (exact_search) as the IR technique used for RAG.
 
- Raise PR on separate branch for code updates & request code owner review.
 - Mark TODO's as issues.
 
We appreciate any ideas and contributions from the open-source community.
We aim to make this project accessible and modularized enough to use as a plug-and-play model for evaluating traditional and modern Closed Domain Question Answering approaches.
Please feel free to contact the authors if you are interested to contribute or collaborate: