Code Comprehension RAG CLI

A citation-grounded code comprehension system using hybrid retrieval (BM25 + dense embeddings), graph-based expansion, query expansion (RM3), cross-encoder reranking, and submodular/greedy context packing.

Features

Hybrid Retrieval: Combines BM25 sparse retrieval with dense semantic embeddings
Query Expansion (RM3): Pseudo-relevance feedback for improved recall
Cross-Encoder Reranking: Neural reranking for improved precision
Graph-RAG: Neo4j-based code graph expansion for cross-file reasoning
Submodular Packing: Diversity-aware context selection
Citation Verification: Strict/loose citation validation against source code
Developer Brief: Structured output with strategy, evidence, and citations
Interactive Selection: Select repos/models interactively (e.g., "1-3,5" or "all")
RQ Presets: Pre-configured experiments for research questions
Progress Tracking: Real-time progress with ETA estimates
Verbose Logging: Comprehensive colored output showing what's happening

Prerequisites

Python 3.10+
LM Studio running at http://localhost:1234/v1 (OpenAI-compatible).

Optional Neo4j 5 (for Graph-RAG):

docker run -d -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/VeryStrongPass123 neo4j:5

Installation

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Clone and Index a Repository

# Clone sample repos (Flask + Werkzeug)
python driver.py --faiss \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use

Or manually:

# Create staging directory and clone repos
mkdir -p third_party/preset_fw
git clone --depth 1 https://github.com/pallets/flask third_party/preset_fw/flask
git clone --depth 1 https://github.com/pallets/werkzeug third_party/preset_fw/werkzeug

# Build index
python cc_cli.py index --repo third_party/preset_fw --out fw_index.json

# Create dense embeddings
python cc_cli.py embed --index fw_index.json --out fw_dense.pkl

# Build FAISS index
python cc_cli.py faiss-build --dense fw_dense.pkl --out-dir fw_index_faiss

2. Load Graph (Optional, requires Neo4j)

python cc_cli.py graph-load \
  --index fw_index.json \
  --repo third_party/preset_fw \
  --neo4j-uri bolt://localhost:7687 \
  --neo4j-user neo4j --neo4j-pass VeryStrongPass123 \
  --wipe

3. Ask Questions

Basic Query (Hybrid Retrieval):

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid \
  --question "Where are HTTP method checks performed before routing? Cite [path.py:start-end]." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Developer Brief and Auto-Citation:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --strict-style --dev-brief --auto-cite-first \
  --llm-timeout 90 --verbose --show-top 8 \
  --question "Where are HTTP method checks handled before routing? Brief + bullets, then CITATION line." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Path and Function Filters:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 100 --max-context-chars 11000 \
  --path-filter "werkzeug/src/werkzeug/utils.py$" \
  --function-filter "(?i)append_slash_redirect" \
  --strict-style --dev-brief --auto-cite-first --verbose \
  --question "Show the exact lines that construct a trailing-slash canonical redirect (method-preserving). Brief + bullets, then CITATION." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Graph Expansion:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.45 --beta 0.55 \
  --k 28 --per-chunk-lines 80 --max-context-chars 11000 \
  --graph-expand --graph-hops 1 --graph-seeds 4 --graph-neighbors 8 \
  --graph-bonus 0.25 --graph-decay 0.6 \
  --graph-timeout 3 --graph-exclude-regex "/tests?/|^tests?/|/test_|/docs?/|\\.rst$|/examples?/" \
  --strict-style --dev-brief --auto-cite-first --verbose --show-top 10 \
  --llm-timeout 90 \
  --question "Where are HTTP method checks handled before routing? Include a brief + bullets, then CITATION line." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With RM3 Query Expansion:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --qe rm3 --qe-fb-docs 10 --qe-fb-terms 10 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Cross-Encoder Reranking:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --rerank --rerank-depth 100 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Greedy Packer:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid \
  --packer greedy --per-chunk-lines 60 --max-context-chars 9000 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

RQ-Specific Experiment Playbooks

List Available RQ Presets

python run_experiment.py --list-rqs

Interactive RQ Mode (Recommended for Exploration)

Select repos and models interactively:

# RQ1: Hybrid vs Sparse vs Dense
python run_experiment.py --rq RQ1 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq1 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ2: Graph-RAG Lift
python run_experiment.py --rq RQ2 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq2 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ3: Packer & Budget Sensitivity
python run_experiment.py --rq RQ3 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq3 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ4: Model Comparisons
python run_experiment.py --rq RQ4 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq4 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ5: α/β & k Sweeps
python run_experiment.py --rq RQ5 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq5 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ6: RM3 & Re-rank Ablation
python run_experiment.py --rq RQ6 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq6 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ7: Evaluation Suite
python run_experiment.py --rq RQ7 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq7 --base-url http://localhost:1234/v1 --api-key lm-studio

When prompted, you can select:

all - Use all items
1-3,5 - Range plus individual (repos 1,2,3 and 5)
1,2,3 - Specific items
Press Enter for default selection

Quick RQ Mode (Uses Preset Defaults)

Run with minimal preset defaults (faster):

# RQ1 Quick: 3 repos, 1 model
python run_experiment.py --rq RQ1 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq1 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ2 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ2 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq2 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ3 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ3 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq3 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ4 Quick: 1 repo, all models
python run_experiment.py --rq RQ4 --quick \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq4 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ5 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ5 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq5 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ6 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ6 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq6 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ7 Quick: 1 repo, 1 model
python run_experiment.py --rq RQ7 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq7 --base-url http://localhost:1234/v1 --api-key lm-studio

Available RQ Presets

RQ	Name	Description	Runs
RQ1	Hybrid vs Sparse vs Dense	Compare retrieval modes	3
RQ2	Graph-RAG Lift	With/without graph expansion	2
RQ3	Packer & Budget	Packing strategies + context sizes	4
RQ4	Model Comparisons	Compare LLM models	1
RQ5	α/β & k Sweeps	Sensitivity analysis	5
RQ6	RM3 & Re-rank Ablation	Query expansion + reranking	4
RQ7	Auto vs Curated	Question suite comparison	1

RQ1 — Hybrid vs Sparse vs Dense

# Sparse retrieval
python run_experiment.py --retrieval sparse --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_sparse --base-url http://localhost:1234/v1 --api-key lm-studio

# Dense retrieval
python run_experiment.py --retrieval dense --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_dense --base-url http://localhost:1234/v1 --api-key lm-studio

# Hybrid retrieval
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_hybrid --base-url http://localhost:1234/v1 --api-key lm-studio

RQ2 — Graph-RAG Lift on Cross-File

# With graph expansion
python run_experiment.py --graph --graph-bonus 0.2 --graph-hops 2 --graph-decay 0.6 \
  --retrieval hybrid --faiss --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_graph --base-url http://localhost:1234/v1 --api-key lm-studio

# Without graph expansion
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_nograph --base-url http://localhost:1234/v1 --api-key lm-studio

RQ3 — Packer & Budget Sensitivity

# Greedy packer with small budget
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --packer greedy --per-chunk-lines 40 --max-context-chars 7000 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_greedy_40_7k --base-url http://localhost:1234/v1 --api-key lm-studio

# Submodular packer with larger budget
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --packer submodular --per-chunk-lines 80 --max-context-chars 12000 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_submod_80_12k --base-url http://localhost:1234/v1 --api-key lm-studio

RQ4 — Model Comparisons (10 LM Studio Models)

python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_models --base-url http://localhost:1234/v1 --api-key lm-studio

RQ5 — α/β & k Sweeps

# Run grid from experiments.yaml
python run_grid.py --config experiments.yaml

# Aggregate results
python aggregate_results.py --workdir work --out-csv all_sensitivity.csv

RQ6 — RM3 & Re-rank Ablation

# No QE, no rerank (baseline)
python run_experiment.py --retrieval hybrid --faiss --graph --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_noqe_norerank --base-url http://localhost:1234/v1 --api-key lm-studio

# With RM3 query expansion only
python run_experiment.py --retrieval hybrid --faiss --graph --qe rm3 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_qe --base-url http://localhost:1234/v1 --api-key lm-studio

# With reranking only
python run_experiment.py --retrieval hybrid --faiss --graph --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rerank --base-url http://localhost:1234/v1 --api-key lm-studio

# With both RM3 and reranking
python run_experiment.py --retrieval hybrid --faiss --graph --qe rm3 --rerank \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_qe_rerank --base-url http://localhost:1234/v1 --api-key lm-studio

RQ7 — Evaluation Suite

python eval_plus.py --index fw_index.json --suite suites/flask_qas.jsonl \
  --retrieval hybrid --use-faiss --faiss-dir fw_index_faiss --use-dense --dense fw_dense.pkl \
  --base-url http://localhost:1234/v1 --api-key lm-studio --model llama-3-groq-8b-tool-use \
  --out-csv curated_vs_auto.csv

Aggregation & Analysis

# Aggregate results from all runs
python aggregate_results.py --workdir work --out-csv all_results.csv

# Generate appendix with methods and hyperparameters
python generate_appendix.py \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --experiments-config experiments.yaml --out appendix.md

# Generate plots
python plots.py --csv work/all_results.csv --out figs/plot.png

CLI Reference

cc_cli.py Commands

Command	Description
`index`	Build an index.json from a repo
`embed`	Create dense vectors for index
`faiss-build`	Build FAISS index from dense.pkl
`ask`	Ask a question against the index
`graph-load`	Load a Neo4j code graph from index + repo

Key Arguments for `ask`

Argument	Description	Default
`--retrieval`	sparse, dense, or hybrid	hybrid
`--alpha`	Weight for sparse in hybrid	0.3
`--beta`	Weight for dense in hybrid	0.7
`--k`	Candidates to retrieve	10
`--packer`	submodular or greedy	submodular
`--qe`	Query expansion: none or rm3	none
`--rerank`	Enable cross-encoder reranking	false
`--graph-expand`	Enable Neo4j graph expansion	false
`--strict-style`	Strict system prompt for citations	false
`--dev-brief`	Print developer brief	false
`--auto-cite-first`	Auto-append citation if missing	false

Key Arguments for `run_experiment.py`

Argument	Description	Default
`--repos-file`	File with repository URLs	-
`--models-file`	File with model names	-
`--workdir`	Output directory	-
`--retrieval`	sparse, dense, or hybrid	hybrid
`--qe`	Query expansion: none or rm3	none
`--packer`	submodular or greedy	submodular
`--rerank`	Enable cross-encoder reranking	false
`--graph`	Enable graph expansion	false
`--faiss`	Enable FAISS indexing	false

Project Structure

.
├── cc_cli.py              # Main CLI tool
├── cc_index.py            # Code indexing
├── cc_graph.py            # Neo4j graph loading
├── run_experiment.py      # Experiment runner (direct + YAML modes)
├── run_experiments.py     # Wrapper for run_experiment.py
├── run_grid.py            # Grid search runner
├── run_grid.sh            # Shell-based grid runner
├── eval_plus.py           # Evaluation suite runner
├── aggregate_results.py   # Results aggregation
├── generate_questions.py  # Question generation
├── generate_appendix.py   # Appendix generation
├── plots.py               # Visualization
├── multi_index.py         # Multi-repo indexing
├── driver.py              # Quick-start driver
├── experiments.yaml       # Grid configuration
├── repos_500.txt          # Repository list
├── models_lmstudio_10.txt # Model list
├── requirements.txt       # Dependencies
├── suites/                # Question suites
│   └── flask_qas.jsonl
└── README.md

Troubleshooting

No citations / wrong file

Tighten with --path-filter and/or --function-filter
Add --auto-cite-first
Increase --k for more candidates

Hangs on graph

Set --graph-timeout 3
Reduce --graph-neighbors
Add --graph-exclude-regex "/tests?/|/docs?/|/examples?/"

LM timeout

Use --llm-timeout 120
Confirm LM Studio model context length and server health

Tokenizers fork warning (macOS)

export TOKENIZERS_PARALLELISM=false

Neo4j connection issues

Ensure Neo4j is running: neo4j start
Check credentials in --neo4j-uri, --neo4j-user, --neo4j-pass

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
codecite_website		codecite_website
.gitignore		.gitignore
README.md		README.md
aggregate_results.py		aggregate_results.py
appendix.md		appendix.md
cc_cli.py		cc_cli.py
cc_graph.py		cc_graph.py
cc_index.py		cc_index.py
curated_vs_auto.csv		curated_vs_auto.csv
driver.py		driver.py
eval_plus.py		eval_plus.py
experiments.yaml		experiments.yaml
fw_index.json		fw_index.json
generate_appendix.py		generate_appendix.py
generate_questions.py		generate_questions.py
index.html		index.html
models_lmstudio_10.txt		models_lmstudio_10.txt
multi_index.py		multi_index.py
plots.py		plots.py
repos_500.txt		repos_500.txt
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py
run_experiments.py		run_experiments.py
run_grid.py		run_grid.py
run_grid.sh		run_grid.sh

jahidul-arafat/codecite

Folders and files

Latest commit

History

Repository files navigation