Skip to content

jahidul-arafat/codecite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Comprehension RAG CLI

A citation-grounded code comprehension system using hybrid retrieval (BM25 + dense embeddings), graph-based expansion, query expansion (RM3), cross-encoder reranking, and submodular/greedy context packing.

Features

  • Hybrid Retrieval: Combines BM25 sparse retrieval with dense semantic embeddings
  • Query Expansion (RM3): Pseudo-relevance feedback for improved recall
  • Cross-Encoder Reranking: Neural reranking for improved precision
  • Graph-RAG: Neo4j-based code graph expansion for cross-file reasoning
  • Submodular Packing: Diversity-aware context selection
  • Citation Verification: Strict/loose citation validation against source code
  • Developer Brief: Structured output with strategy, evidence, and citations
  • Interactive Selection: Select repos/models interactively (e.g., "1-3,5" or "all")
  • RQ Presets: Pre-configured experiments for research questions
  • Progress Tracking: Real-time progress with ETA estimates
  • Verbose Logging: Comprehensive colored output showing what's happening

Prerequisites

  • Python 3.10+

  • LM Studio running at http://localhost:1234/v1 (OpenAI-compatible).

  • Optional Neo4j 5 (for Graph-RAG):

    docker run -d -p 7474:7474 -p 7687:7687 \
      -e NEO4J_AUTH=neo4j/VeryStrongPass123 neo4j:5

Installation

# Create and activate virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Quick Start

1. Clone and Index a Repository

# Clone sample repos (Flask + Werkzeug)
python driver.py --faiss \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use

Or manually:

# Create staging directory and clone repos
mkdir -p third_party/preset_fw
git clone --depth 1 https://github.com/pallets/flask third_party/preset_fw/flask
git clone --depth 1 https://github.com/pallets/werkzeug third_party/preset_fw/werkzeug

# Build index
python cc_cli.py index --repo third_party/preset_fw --out fw_index.json

# Create dense embeddings
python cc_cli.py embed --index fw_index.json --out fw_dense.pkl

# Build FAISS index
python cc_cli.py faiss-build --dense fw_dense.pkl --out-dir fw_index_faiss

2. Load Graph (Optional, requires Neo4j)

python cc_cli.py graph-load \
  --index fw_index.json \
  --repo third_party/preset_fw \
  --neo4j-uri bolt://localhost:7687 \
  --neo4j-user neo4j --neo4j-pass VeryStrongPass123 \
  --wipe

3. Ask Questions

Basic Query (Hybrid Retrieval):

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid \
  --question "Where are HTTP method checks performed before routing? Cite [path.py:start-end]." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Developer Brief and Auto-Citation:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --strict-style --dev-brief --auto-cite-first \
  --llm-timeout 90 --verbose --show-top 8 \
  --question "Where are HTTP method checks handled before routing? Brief + bullets, then CITATION line." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Path and Function Filters:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 100 --max-context-chars 11000 \
  --path-filter "werkzeug/src/werkzeug/utils.py$" \
  --function-filter "(?i)append_slash_redirect" \
  --strict-style --dev-brief --auto-cite-first --verbose \
  --question "Show the exact lines that construct a trailing-slash canonical redirect (method-preserving). Brief + bullets, then CITATION." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Graph Expansion:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.45 --beta 0.55 \
  --k 28 --per-chunk-lines 80 --max-context-chars 11000 \
  --graph-expand --graph-hops 1 --graph-seeds 4 --graph-neighbors 8 \
  --graph-bonus 0.25 --graph-decay 0.6 \
  --graph-timeout 3 --graph-exclude-regex "/tests?/|^tests?/|/test_|/docs?/|\\.rst$|/examples?/" \
  --strict-style --dev-brief --auto-cite-first --verbose --show-top 10 \
  --llm-timeout 90 \
  --question "Where are HTTP method checks handled before routing? Include a brief + bullets, then CITATION line." \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With RM3 Query Expansion:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --qe rm3 --qe-fb-docs 10 --qe-fb-terms 10 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Cross-Encoder Reranking:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid --alpha 0.35 --beta 0.65 \
  --k 24 --per-chunk-lines 80 --max-context-chars 10000 \
  --rerank --rerank-depth 100 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

With Greedy Packer:

python cc_cli.py ask \
  --index fw_index.json --dense fw_dense.pkl --faiss-dir fw_index_faiss \
  --retrieval hybrid \
  --packer greedy --per-chunk-lines 60 --max-context-chars 9000 \
  --strict-style --dev-brief --auto-cite-first \
  --question "Where are HTTP method checks handled before routing?" \
  --base-url http://localhost:1234/v1 --api-key lm-studio \
  --model llama-3-groq-8b-tool-use --verify-citations strict

RQ-Specific Experiment Playbooks

List Available RQ Presets

python run_experiment.py --list-rqs

Interactive RQ Mode (Recommended for Exploration)

Select repos and models interactively:

# RQ1: Hybrid vs Sparse vs Dense
python run_experiment.py --rq RQ1 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq1 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ2: Graph-RAG Lift
python run_experiment.py --rq RQ2 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq2 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ3: Packer & Budget Sensitivity
python run_experiment.py --rq RQ3 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq3 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ4: Model Comparisons
python run_experiment.py --rq RQ4 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq4 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ5: α/β & k Sweeps
python run_experiment.py --rq RQ5 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq5 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ6: RM3 & Re-rank Ablation
python run_experiment.py --rq RQ6 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq6 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ7: Evaluation Suite
python run_experiment.py --rq RQ7 --interactive \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq7 --base-url http://localhost:1234/v1 --api-key lm-studio

When prompted, you can select:

  • all - Use all items
  • 1-3,5 - Range plus individual (repos 1,2,3 and 5)
  • 1,2,3 - Specific items
  • Press Enter for default selection

Quick RQ Mode (Uses Preset Defaults)

Run with minimal preset defaults (faster):

# RQ1 Quick: 3 repos, 1 model
python run_experiment.py --rq RQ1 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq1 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ2 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ2 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq2 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ3 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ3 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq3 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ4 Quick: 1 repo, all models
python run_experiment.py --rq RQ4 --quick \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq4 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ5 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ5 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq5 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ6 Quick: 2 repos, 1 model
python run_experiment.py --rq RQ6 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq6 --base-url http://localhost:1234/v1 --api-key lm-studio

# RQ7 Quick: 1 repo, 1 model
python run_experiment.py --rq RQ7 --quick \
  --repos-file repos_500.txt --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rq7 --base-url http://localhost:1234/v1 --api-key lm-studio

Available RQ Presets

RQ Name Description Runs
RQ1 Hybrid vs Sparse vs Dense Compare retrieval modes 3
RQ2 Graph-RAG Lift With/without graph expansion 2
RQ3 Packer & Budget Packing strategies + context sizes 4
RQ4 Model Comparisons Compare LLM models 1
RQ5 α/β & k Sweeps Sensitivity analysis 5
RQ6 RM3 & Re-rank Ablation Query expansion + reranking 4
RQ7 Auto vs Curated Question suite comparison 1

RQ1 — Hybrid vs Sparse vs Dense

# Sparse retrieval
python run_experiment.py --retrieval sparse --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_sparse --base-url http://localhost:1234/v1 --api-key lm-studio

# Dense retrieval
python run_experiment.py --retrieval dense --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_dense --base-url http://localhost:1234/v1 --api-key lm-studio

# Hybrid retrieval
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_hybrid --base-url http://localhost:1234/v1 --api-key lm-studio

RQ2 — Graph-RAG Lift on Cross-File

# With graph expansion
python run_experiment.py --graph --graph-bonus 0.2 --graph-hops 2 --graph-decay 0.6 \
  --retrieval hybrid --faiss --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_graph --base-url http://localhost:1234/v1 --api-key lm-studio

# Without graph expansion
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_nograph --base-url http://localhost:1234/v1 --api-key lm-studio

RQ3 — Packer & Budget Sensitivity

# Greedy packer with small budget
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --packer greedy --per-chunk-lines 40 --max-context-chars 7000 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_greedy_40_7k --base-url http://localhost:1234/v1 --api-key lm-studio

# Submodular packer with larger budget
python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --packer submodular --per-chunk-lines 80 --max-context-chars 12000 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_submod_80_12k --base-url http://localhost:1234/v1 --api-key lm-studio

RQ4 — Model Comparisons (10 LM Studio Models)

python run_experiment.py --retrieval hybrid --faiss --rerank --qe none --graph \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_models --base-url http://localhost:1234/v1 --api-key lm-studio

RQ5 — α/β & k Sweeps

# Run grid from experiments.yaml
python run_grid.py --config experiments.yaml

# Aggregate results
python aggregate_results.py --workdir work --out-csv all_sensitivity.csv

RQ6 — RM3 & Re-rank Ablation

# No QE, no rerank (baseline)
python run_experiment.py --retrieval hybrid --faiss --graph --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_noqe_norerank --base-url http://localhost:1234/v1 --api-key lm-studio

# With RM3 query expansion only
python run_experiment.py --retrieval hybrid --faiss --graph --qe rm3 \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_qe --base-url http://localhost:1234/v1 --api-key lm-studio

# With reranking only
python run_experiment.py --retrieval hybrid --faiss --graph --rerank --qe none \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_rerank --base-url http://localhost:1234/v1 --api-key lm-studio

# With both RM3 and reranking
python run_experiment.py --retrieval hybrid --faiss --graph --qe rm3 --rerank \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --per-repo-questions suites/flask_qas.jsonl \
  --workdir work_qe_rerank --base-url http://localhost:1234/v1 --api-key lm-studio

RQ7 — Evaluation Suite

python eval_plus.py --index fw_index.json --suite suites/flask_qas.jsonl \
  --retrieval hybrid --use-faiss --faiss-dir fw_index_faiss --use-dense --dense fw_dense.pkl \
  --base-url http://localhost:1234/v1 --api-key lm-studio --model llama-3-groq-8b-tool-use \
  --out-csv curated_vs_auto.csv

Aggregation & Analysis

# Aggregate results from all runs
python aggregate_results.py --workdir work --out-csv all_results.csv

# Generate appendix with methods and hyperparameters
python generate_appendix.py \
  --repos-file repos_500.txt --models-file models_lmstudio_10.txt \
  --experiments-config experiments.yaml --out appendix.md

# Generate plots
python plots.py --csv work/all_results.csv --out figs/plot.png

CLI Reference

cc_cli.py Commands

Command Description
index Build an index.json from a repo
embed Create dense vectors for index
faiss-build Build FAISS index from dense.pkl
ask Ask a question against the index
graph-load Load a Neo4j code graph from index + repo

Key Arguments for ask

Argument Description Default
--retrieval sparse, dense, or hybrid hybrid
--alpha Weight for sparse in hybrid 0.3
--beta Weight for dense in hybrid 0.7
--k Candidates to retrieve 10
--packer submodular or greedy submodular
--qe Query expansion: none or rm3 none
--rerank Enable cross-encoder reranking false
--graph-expand Enable Neo4j graph expansion false
--strict-style Strict system prompt for citations false
--dev-brief Print developer brief false
--auto-cite-first Auto-append citation if missing false

Key Arguments for run_experiment.py

Argument Description Default
--repos-file File with repository URLs -
--models-file File with model names -
--workdir Output directory -
--retrieval sparse, dense, or hybrid hybrid
--qe Query expansion: none or rm3 none
--packer submodular or greedy submodular
--rerank Enable cross-encoder reranking false
--graph Enable graph expansion false
--faiss Enable FAISS indexing false

Project Structure

.
├── cc_cli.py              # Main CLI tool
├── cc_index.py            # Code indexing
├── cc_graph.py            # Neo4j graph loading
├── run_experiment.py      # Experiment runner (direct + YAML modes)
├── run_experiments.py     # Wrapper for run_experiment.py
├── run_grid.py            # Grid search runner
├── run_grid.sh            # Shell-based grid runner
├── eval_plus.py           # Evaluation suite runner
├── aggregate_results.py   # Results aggregation
├── generate_questions.py  # Question generation
├── generate_appendix.py   # Appendix generation
├── plots.py               # Visualization
├── multi_index.py         # Multi-repo indexing
├── driver.py              # Quick-start driver
├── experiments.yaml       # Grid configuration
├── repos_500.txt          # Repository list
├── models_lmstudio_10.txt # Model list
├── requirements.txt       # Dependencies
├── suites/                # Question suites
│   └── flask_qas.jsonl
└── README.md

Troubleshooting

No citations / wrong file

  • Tighten with --path-filter and/or --function-filter
  • Add --auto-cite-first
  • Increase --k for more candidates

Hangs on graph

  • Set --graph-timeout 3
  • Reduce --graph-neighbors
  • Add --graph-exclude-regex "/tests?/|/docs?/|/examples?/"

LM timeout

  • Use --llm-timeout 120
  • Confirm LM Studio model context length and server health

Tokenizers fork warning (macOS)

export TOKENIZERS_PARALLELISM=false

Neo4j connection issues

  • Ensure Neo4j is running: neo4j start
  • Check credentials in --neo4j-uri, --neo4j-user, --neo4j-pass

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published