A light weight memory-augmented RAG application Research Assistant built with FastAPI, Ollama, ChromaDB and Transformers. Supports Persistent Memory, Tooling and Docker Deployment.
-
✅ Switchable LLM Backends (Ollama / Llama CPP) via. .env configurations.
-
✅ Tool calling support. (Currently implemented Wikipedia search.)
-
✅ Streaming response supported via. websockets
See how to test streaming response!
After the application runs successfully, just open `socket_tester.html` from the root directory!
-
✅ File upload / clear support. Embedding generation.
-
✅ Multi-stage docker builds for reduction of image size.
-
✅ Data ingested with incremental upsert and option to overwrite.
-
✅ Custom RAG Agent orchestration with memory augmentation.
-
✅
/metrics
API with Prometheus Client -
✅ Make commands available to run app in docker / generate deployable docker image. (internally uses docker compose)
-
~850MB final fully initialized docker image size for easy deployment stable usage even in low spec machines.
-
✅ Custom embedder support.
Important notice regarding embedders
A valid onnx encoder bundle must be present in the
/encoders
directory under same model name (as set in .env). By default the *all-MiniLM-L6-v2-onnx* is present in the encoders directory. Must include.onnx
,config.json
etc. -
✅ Fast low latency response
-
✅ Rotating Logfile generation
-
✅ Unit tests with 85% Coverage!.
Click to view latest coverage table
File Miss Fail Skip Coverage src/__init__.py 0 0 0 100% src/agent/__init__.py 0 0 0 100% src/agent/rag_agent.py 42 0 0 100% src/agent/tools/__init__.py 0 0 0 100% src/agent/tools/tool_base.py 5 1 0 80% src/agent/tools/wikipedia_search_tool.py 15 2 0 87% src/api/__init__.py 0 0 0 100% src/api/app.py 17 2 0 88% src/api/constants.py 12 0 0 100% src/api/dependencies/__init__.py 0 0 0 100% src/api/dependencies/services.py 20 3 0 85% src/api/routes/__init__.py 0 0 0 100% src/api/routes/ask.py 47 9 0 81% src/api/routes/file_io.py 37 3 0 92% src/api/routes/ingest.py 42 10 0 76% src/api/routes/metrics.py 6 0 0 100% src/api/schemas/__init__.py 0 0 0 100% src/api/schemas/ask.py 7 0 0 100% src/config/__init__.py 0 0 0 100% src/config/config.py 36 0 0 100% src/ingestion/__init__.py 0 0 0 100% src/ingestion/document_ingestor.py 65 1 0 98% src/llm/llama_cpp_client.py 64 11 0 83% src/llm/llm_client_base.py 5 1 0 80% src/llm/ollama_client.py 40 0 0 100% src/store/__init__.py 0 0 0 100% src/store/memory_buffer.py 13 0 0 100% src/store/vector_store_client.py 33 1 0 97% src/utils/__init__.py 0 0 0 100% src/utils/embedding_generator.py 39 2 0 95% src/utils/initializer.py 24 24 0 0% src/utils/logger_config.py 22 22 0 0% src/utils/metrics.py 4 0 0 100% Total 595 92 0 85% -
✅ Ensured quality code with mypy, ruff/pylint and consistent formatting with black.
- Features
- Running The App
- Configurations
- System Architecture
- Known Issues 🐞
Make configurations are available for running the app in Docker.
- Docker (Needed for using Make commands)
- Make
- Ollama (Optional)
- Python 3 (Optional)
- LLM Module (.onnx) (Needed to run using LLAMA CPP)
That's it. Build it and it can run offline!.
make dev
Initiates a docker compose to run the app based on LLM_BACKEND .env flag. (See .env.example
)
make dev-with-ollama-container
This will create separate containers for hosting Ollama and our app. They communicate internally. Phi3 mini model is setup by default. Entrypoint scripts are setup and our app will wait for the Ollama container to be ready before starting the app.
Note
make stop
- Remove docker containers and any startup script*
make clean
- Stop and remove docker containers and images including any volume mounts
make create
- Create a docker image .tar file for deployment.
make test
- Generate an interactive version of the test coverage report. This requires dev-dependencies. Can be installed by
pip install -r dev-requirements.txt
.
Note
To see the latest test result open http://localhost:8002
after:
cd ./htmlcov && python -m http-server 8002
Caution
If you are using the "run" command, you need to ensure that app container can access the OLLAMA URL (if using Ollama backend) or supply a .gguf
LLM model to the /models
directory (if using Llama CPP backend). For linux systems, this can be done by creating config.toml and adding the following in $HOME_DIR/.ollama/config.toml
[server]
listen = "0.0.0.0:11434"
You can also host the application locally. The project source code contains requirements.txt containing dependencies needed by the project.
- Create a virtual env python -m venv .venv && source ./.venv/bin/activate
- Install dependencies
pip install -r requirements.txt
- Tests are available and written with pytest. Install dev dependencies to run tests.
pip install -r dev-requirements.txt
- Run the main module
python -m main
OR
python ./main.py
(If you are on the ROOT_DIR)
- System Logs are available in /logs folder.
To avoid any permissions issue, remove any /models (models for llama cpp) or /chroma_db (persistent vector storage) or /logs or /sample_data (has the documents to be searched) directories.
The project depends on user configurations. Use .env file to set configurations when running locally. Makefile commands internally use docker compose to setup containers. The environment variables can be set in docker-compose.base.yaml
- LLM_BACKEND - Responsible for switching backends!
- LLAMA_CPP_MODEL_FILE - A compatible LLM Model for LLama CPP if using this backend.
- OLLAMA_CLIENT_URL - If the OLLAMA backend is hosted in non-default url, can directly use it via this flag.
These flags are enough to switch backends and run the app. There are other configurations (like deciding request timeout, context window etc). Check .env.example for available configs.
This document outlines the system architecture for the application, including its main components and their interactions.
graph TB
A[Main] --> B[System]
A --> C[FastAPI App]
B --> D[Configs]
B --> E[Logger]
B --> F[Metrics]
C --> H[Routes]
H --> I[File I/O]
H --> J[Ask]
H --> K[Metrics]
H --> L[Ingest]
M[RAG Agent] <--> N[Memory]
M <--> R[Embedding Generator]
M <--> O[Document Ingestor]
M <--> S[LLM Client]
O <--> P[Vector DB Client]
P --> Q[Chroma DB]
B <--> M
C <--> B
- The central entry point of the application. It can be run as module
python -m main
. If using Makefiles, will be used by docker compose.
- The API server
src/api/app.py
. Triggered bymain.py
- Exposes REST endpoints via routes.
- View
Swagger UI
for exact EP and their schemas. - Route files include:
ASK
- Accepts natural language questions, calls RAG agent and returns responses.Ingest
- Triggers background ingestion of documents into vector DB.File I/O
- Includes file handling for uploaded documents which will be used by the agent during ask.Metrics
- Get usage stats.
Warning
- Core utilities used across the app along with custom RAG Agent to orchestrate the flow.
- Includes:
Configs
config.py
: Loads env vars, defaults, model paths. Intialized prior to the API. Logger: Implemented file-size based Rotating Logs using logging module. Seelogger_config.py
for adjustments. - Metrics Time/performance tracking using Prometheus hooks via prometheus-client. See
utils/metrics.py
flowchart TD
%% User Interaction
A[User Query] --> B[**/ask** Endpoint]
%% RAG Agent
B --> C[RAG Agent]
%% Memory Lookup
C --> D[Load Short-Term Memory]
%% Embedding + Retrieval
C --> E[Embed Query<br>]
E --> F[Search ChromaDB<br>]
F --> G[Retrieved Chunks]
%% Tool Usage
C --> H[Run Tools<br>]
H --> I[Tool Results]
%% Combine Context
G --> J[Merge Memory + Docs + Tools]
I --> J
D --> J
%% LLM Generation
J --> K[Prompt LLM<br>]
%% Store and Respond
K --> L[Save QA to MemoryBuffer]
L --> M[Return Response to User]
%% Cycle back
M --> A
The RAGAgent class (see rag_agent.py
) orchestrates the full pipeline for answering user queries using contextually enriched retrieval-augmented generation. It integrates multiple components:
- A vector store (
VectorStoreClient
) to retrieve relevant documents - An LLM client (
LLMClientBase
) to generate answers - A memory buffer to maintain short-term conversation context
- An embedding generator to convert queries into vector representations
- An optional tool runner (implemented Wikipedia search here) for low-confidence results.
Important
- Embed incoming queries using an ONNX-based encoder
- Retrieve top-k relevant chunks from ChromaDB
- Optionally fallback to a tool (Wikipedia) if confidence is low
- Construct a prompt using retrieved context + prior memory
- Call the LLM and return structured results (answer + sources)
- Store the turn in memory for continuity
While offer graph-based orchestration and plugin support:
- ✅ Dependency-free approach and avoids over-engineering
- ✅ Show-case the basic understanding of underlying RAG Principles rather than abstracting away with libraries and frameworks.
- ✅ Rapid proto-typing
- ✅ Custom class-based architecture gives us full control and testability
LangGraph
can be integrated to define complex decision logic (e.g. even better control on when to use tools, multi-hop reasoning)LangChain
may be used to chain retrievers, summarizers, and rankers- These will be evaluated as feature requirements grow
The MemoryBuffer
class holds a short-term history of conversation turns. It is used to maintain context across user queries. (See memory_buffer.py
)
- Configurable number of turns (
max_turns
) - Trims oldest interactions to respect buffer size
- Provides formatted history for prompt injection
User: What is AI?
Assistant: AI stands for Artificial Intelligence...
User: And what about AGI?
Assistant: AGI refers to Artificial General Intelligence...
EmbeddingGenerator
is a singleton class that loads an ONNX model and tokenizer from disk to generate vector embeddings. These are used for document retrieval. (See embedding_generator.py
)
- Loads the embedding model (
MiniLM
,all-MiniLM-L6-v2
, etc.) from a local ONNX model folder - Uses HuggingFace tokenizer and ONNXRuntime session
- Performs mean pooling with attention mask to ignore padding
- Normalizes embeddings to unit length for cosine similarity
- High-performance inference even on CPU-only systems
- Small binary size (~90MB for MiniLM)
- Reusable session instance across requests (singleton pattern)
Important
Originally tried embedding generation with Sentence Transformer module. But this caused the final docker image ~6.76GB. With this hugging face embedder and Transformers library, the final image size reduced by almost 9 times!
The VectorStoreClient
class wraps access to the ChromaDB vector store. It provides an interface to insert, query, and delete vectorized document chunks. Supports incremental upsers! (can be adjust to overwrite instead with a query param. See vector_store_client.py
)
-
Connects to ChromaDB via PersistentClient
-
Creates/retrieves a named collection
-
Supports vector-based retrieval, manual metadata filtering, and ID-based deletion
-
add_documents
: Ingests vectorized documents into the store -
query
: Performs a text-based retrieval -
query_by_vector
: Vector similarity search -
delete_by_source
: Deletes all chunks for a given document filename
-
Add support for document tagging or versioning
-
Extend with custom EmbeddingFunction adapters for hybrid retrieval
The LLMClient
class is controls the actual communication with an underlying LLM model like Phi3
. The LLMClientBase
abstract base class defines the unified interface for any LLM client backend. (See llm_client_base.py
)
Project currently supports Ollama client and Llama CPP. The app supports switchable backends!. Also, each client is supports streaming response!.
Any specific LLM client must inherit from this ABC. It enforces the function:
generate(**kwargs) -> str
: Accepts prompts and returns a response (streaming or not)
OllamaClient is a client wrapper that connects to a local Ollama server via its /api/generate
HTTP API. (See ollama_client.py
)
-
Supports both streaming and non-streaming responses
-
Automatically appends system_prompt if provided
-
Gracefully handles JSON decoding issues
LlamaCppClient wraps the native llama.cpp library (via llama-cpp-python) for ultra-efficient local inference using GGUF models.
- Supports streaming and non-streaming responses
- Can run fully offline using mmap/mlock memory access
-
Add quantization-aware memory profiling
-
Use n_batch config for large batch inferencing
-
Add support for context persistence (e.g. KV cache reuse)
The app is effectively issue free when running locally but there are some issues when running in containerized setups.
-
Ollama Container Initialization Delay
Ollama can take several seconds to load and initialize on startup specially with large models. Also, newly formed containers suffer from cold-start issue causeing first request from our app to take significant delay. -
Timeouts and Slow Responses in Containerized Setup
Communication between the RAG app and Ollama inside Docker containers is slower compared to calling a locally installed Ollama server. -
Memory Constraints on Low-RAM Systems
Even though our app runs on managable memory, running Ollama with larger models requires significant RAM (≥5-6 GB). Phi3 model uses ~2.6GB. Systems with less memory (e.g., 4 GB) may fail to start Ollama properly or run into out-of-memory errors. -
Platform-Specific Networking Issues
Accessing Ollama running on the host machine from a container is tricky. Usinglocalhost
inside containers will not work; instead,host.docker.internal
or Docker'sextra_hosts
mapping must be configured, which varies across OSes. -
Container Permission Issue To make containers deployable, directories in the containers are mounted to host directories. But this can be even trickier and can cause permission issues if using stale directories. Suggestion to always remove the mounted directories (or atleast run docker compose via make commands)