Cross-Project Vulnerability Intelligence System
VulnRAG is a vulnerability intelligence platform that aggregates known CVEs from multiple public databases, stores them with their actual source code (vulnerable and fixed versions), and makes them searchable by code similarity, natural language, attack classification, or any combination. It uses vector embeddings and multi-strategy retrieval to surface patterns across projects and languages -- turning scattered vulnerability data into a structured, queryable knowledge base for security research.
-
Collects vulnerability data from 7 sources: GitHub Security Advisories, CVEfixes, NVD, MegaVul, BigVul, MoreFixes, and Security DPO. Each vulnerability is stored with its CVE ID, severity score, CWE classification, affected project, and -- when available -- the actual vulnerable and fixed source code.
-
Indexes the collected data into searchable vector embeddings. Code snippets are embedded using UnixCoder (a model trained on source code across languages). Descriptions are embedded using BGE (a text embedding model). Both are stored in ChromaDB for fast similarity search.
-
Retrieves relevant vulnerabilities using four strategies that run in parallel and get merged into a single ranked result set:
- Code similarity: "Does this code look like known vulnerable code?"
- Pattern matching: "What CVEs involve this type of dangerous function in this protocol component?"
- Cross-project search: "This pattern caused a CVE in project A -- does project B have something similar?"
- Conceptual search: "Find everything related to HTTP request smuggling via chunked encoding"
-
Generates Semgrep rules and CodeQL queries from matched vulnerabilities, so you can scan codebases for the same patterns.
- Architecture
- Data Pipeline
- Retrieval Strategies
- Data Sources
- Technology Stack
- Installation
- Configuration
- Usage
- Rule Generation
- API Endpoints
- Canonical Schema
- Project Structure
- Deployment
- Examples
- License
+------------------+
| Data Sources |
| GitHub | CVEfixes|
| NVD | MegaVul |
| BigVul | MoreFix |
+--------+---------+
|
+--------v---------+
| Collectors |
| (7 source- |
| specific |
| adapters) |
+--------+---------+
|
+--------v---------+
| Normalizers |
| Canonical Schema |
| Transformers |
+--------+---------+
|
+-------------+-------------+
| |
+--------v---------+ +--------v---------+
| SQLite Store | | ChromaDB |
| Vulnerabilities | | Code Embeddings |
| Code Samples | | Desc Embeddings |
| Versions | | Pattern Index |
| References | | Dimension Index |
+--------+---------+ +--------+---------+
| |
+-------------+-------------+
|
+--------v---------+
| Retrieval Fusion |
| Code Similarity |
| Pattern-Based |
| Cross-Project |
| Conceptual |
+--------+---------+
|
+-------------------+-------------------+
| | |
+--------v------+ +--------v------+ +---------v-----+
| CLI | | REST API | | Web UI |
| vulnrag ... | | FastAPI v2/v3 | | Search/Browse |
+---------------+ +---------------+ +---------------+
- Collectors pull raw vulnerability data from external sources (GitHub GraphQL API, SQLite databases, HuggingFace datasets, NVD REST API)
- Transformers normalize each record to a canonical Pydantic schema, classifying attack type, protocol component, and dangerous operations
- SQLite Store persists the relational data (vulnerabilities, code samples, version info, references) with composite indexes for fast filtering
- Vector Indexer generates embeddings and populates ChromaDB collections with filterable metadata
- Retrieval Fusion executes multiple search strategies in parallel, deduplicates, applies multi-strategy boosting, and returns ranked results
- Consumers (CLI, API, Web UI) present results with CVE references, code context, and cross-project citations
Each data source has a dedicated collector that knows how to pull and parse its format:
| Source | What it provides | How it's collected |
|---|---|---|
| GitHub Security Advisories | Real-time advisories with commit references | GraphQL API queries per repository |
| CVEfixes | Historical vulnerable/fixed code pairs | SQLite database import |
| MoreFixes | Extended fix dataset | PostgreSQL dump import |
| NVD | CVSS scores, CWE IDs, version ranges | REST API queries |
| MegaVul | 43,690 vulnerable function samples | HuggingFace dataset download |
| BigVul | Large-scale vulnerability dataset | HuggingFace dataset download |
| Security DPO | Multi-language vulnerability samples | HuggingFace dataset download |
Every record is converted to a single canonical schema (see Canonical Schema) regardless of which source it came from. This means a CVE imported from NVD and the same CVE imported from CVEfixes get merged into one record with combined metadata.
The normalized records are then embedded into three ChromaDB collections:
| Collection | What's embedded | Used for |
|---|---|---|
vuln_by_code |
Vulnerable code snippets (via UnixCoder) | "Show me CVEs with code that looks like this" |
vuln_by_description |
Vulnerability descriptions (via BGE) | "Show me CVEs that match this description" |
vuln_by_pattern |
Structured metadata (attack class, CWE, protocol) | "Show me all buffer overflows in HTTP/2 parsers" |
When you search, four strategies run in parallel and their results are fused:
- Your query code or text gets embedded using the same models
- Each strategy searches its relevant collection(s) with optional filters
- Results are deduplicated across strategies
- Items found by multiple strategies get a confidence boost
- Final results are ranked by combined score
Embeds a target code snippet using UnixCoder and queries vuln_by_code. Returns vulnerabilities with structurally or semantically similar vulnerable code, even across languages.
When to use: You have a piece of code and want to know if it resembles any known vulnerable pattern.
Filters by extracted characteristics: protocol component, dangerous sink functions, and CWE identifiers. Searches the description collection with metadata constraints.
When to use: You know what kind of vulnerability you're looking for (e.g., "resource exhaustion in connection pool handling").
Queries for CVEs in other projects that share the same protocol component or attack class. The current project is explicitly excluded to surface patterns that exist elsewhere but haven't been found locally.
When to use: You're reviewing one project and want to know what bugs similar projects have had in the same area.
Embeds a natural language query using BGE and searches descriptions without metadata filters. Catches conceptual relationships that don't map to specific code patterns.
When to use: Broad exploratory queries like "rapid reset denial of service variants across implementations".
Results from all strategies are combined using Reciprocal Rank Fusion (RRF):
RRF Score = sum(1 / (k + rank_i)) for each strategy
If the same CVE shows up in multiple strategies, its score gets a 1.2x boost -- the logic being that if both code similarity and cross-project search flag the same vulnerability, it's more likely to be a real match.
Before retrieval, the system analyzes target code to extract searchable characteristics:
- Dangerous sinks: Language-specific functions known to be security-sensitive (e.g.,
memcpy,sprintf,exec,eval) - Protocol hints: Detected protocol handling patterns (HTTP/1.x parsing, HTTP/2 frames, WebSocket, TLS)
- Risk estimation: Heuristic risk level based on detected sinks and patterns
These extracted features are added to the query to improve retrieval precision beyond what pure embedding similarity provides.
Pre-structured vulnerable/fixed code pairs at method-level granularity with CWE mappings. Approximately 12,000 CVEs across languages. Available from Zenodo.
Real-time advisory data via GitHub's GraphQL API. For each repository, the collector fetches advisory metadata, affected version ranges, and references to fixing commits. When --with-code is enabled, it clones the repository at the pre-fix and post-fix commits to extract actual code changes.
NVD records provide CVSS scoring, CWE classification, and version range information. Used to enrich existing records and fill metadata gaps.
- MegaVul (
hitoshura25/megavul): 43,690 vulnerable function samples with rich context - BigVul (
bstee615/bigvul): Large-scale vulnerable function dataset with binary vulnerability labels - Security DPO (
CyberNative/Code_Vulnerability_Security_DPO): Multi-language vulnerability samples
- Published security research (HTTP request smuggling, protocol-level attacks)
- Public bug bounty disclosures
- Academic papers (IEEE S&P, USENIX Security, CCS)
- Security mailing lists (oss-security, full-disclosure)
| Layer | Technology | Purpose |
|---|---|---|
| Language | Python 3.10+ | Core runtime |
| CLI | Typer + Rich | Command-line interface |
| Web Framework | FastAPI + Uvicorn | REST API and web UI |
| Templating | Jinja2 | HTML rendering for web UI |
| ORM | SQLAlchemy 2.0 | Relational data persistence |
| Database | SQLite | Primary vulnerability store |
| Vector Store | ChromaDB 0.4+ | Embedding storage and similarity search |
| Code Embeddings | sentence-transformers (UnixCoder) | Code semantic representations |
| Text Embeddings | sentence-transformers (BGE) | Description semantic representations |
| Deep Learning | PyTorch 2.0+ | Embedding model inference |
| HTTP Client | httpx | Async API calls to external APIs |
| GraphQL | gql | GitHub Security Advisory queries |
| Git | GitPython | Repository cloning and diff extraction |
| Scraping | BeautifulSoup + lxml | Advisory page parsing |
| Validation | Pydantic 2.0+ | Schema validation and settings |
| Containerization | Docker | Production deployment |
- Python 3.10 or higher
- Git
- 4+ GB RAM (for embedding models)
git clone https://github.com/chasingimpact/vulnrag.git
cd vulnrag
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/macOS
# Install the package and dependencies
pip install -e .vulnrag --helpCopy the example environment file and configure:
cp .env.example .env| Variable | Description |
|---|---|
GITHUB_TOKEN |
GitHub Personal Access Token with public_repo and read:org scopes. Create one here. |
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
(none) | LLM API key for natural language query agent and code assistant features |
DATA_DIR |
./data |
Directory for databases, vector stores, and repo caches |
GITHUB_REQUESTS_PER_HOUR |
4000 |
GitHub API rate limit (max 5000) |
CODE_EMBEDDING_MODEL |
microsoft/unixcoder-base |
Model for code embeddings |
TEXT_EMBEDDING_MODEL |
BAAI/bge-base-en-v1.5 |
Model for text embeddings |
# Initialize project directories and verify configuration
vulnrag init
# Collect from GitHub Security Advisories (metadata only)
vulnrag collect --source github --owner <org> --project <repo>
# Collect with actual vulnerable/fixed code from commits
vulnrag collect --source github --owner <org> --project <repo> --with-code
# Import from CVEfixes database
vulnrag collect --source cvefixes --db-path ./CVEfixes.db
# Import specific language from CVEfixes
vulnrag collect --source cvefixes --language C --limit 500
# Import from MoreFixes PostgreSQL dump
vulnrag collect --source morefixes --db-path ./morefixes.sql
# Collect from NVD
vulnrag collect --source nvd --project <name>
# Import from MegaVul dataset
vulnrag collect --source megavul
# Import from BigVul dataset
vulnrag collect --source bigvul
# Import from Security DPO dataset
vulnrag collect --source security_dpo --language python
# Build or update the vector index
vulnrag index
# Rebuild index from scratch
vulnrag index --rebuild
# Search by code file
vulnrag search --file target.c --language c --project myproject
# Search by natural language
vulnrag search --query "HTTP/2 rapid reset denial of service"
# Search with specific strategy
vulnrag search --query "buffer overflow in header parsing" --strategy code_similarity
# Export a vulnerability record
vulnrag export --id CVE-2023-44487 --format yaml
# Enrich vulnerabilities with code from referenced commits
vulnrag enrich --limit 100
# Generate detection rules from RAG-matched vulnerabilities
vulnrag generate --file target.go --language go --format both --output ./rules/
# View database and index statistics
vulnrag stats
# Start the web server
vulnrag serve --host 127.0.0.1 --port 8000Start the server and navigate to http://127.0.0.1:8000:
- Home (
/) -- Dashboard with database statistics, project distribution, and language breakdown - Search (
/search) -- Full-text and code similarity search interface - Browse (
/browse) -- Filter vulnerabilities by project, attack class, severity, and code availability - Detail (
/vuln/{id}) -- Individual vulnerability view with code samples, metadata, and references - Research (
/research) -- Interactive research dashboard with statistical visualizations - Clanker (
/clanker) -- Code assistant for indexed repositories
The server exposes a comprehensive REST API. Interactive documentation is available at /docs (Swagger UI).
VulnRAG can generate Semgrep rules and CodeQL queries from matched vulnerabilities. Given a target code file, it:
- Analyzes the code to detect dangerous sink functions and protocol patterns
- Retrieves the most similar known vulnerabilities from the index
- Generates detection rules targeting the matched vulnerability patterns
- Includes CWE references, confidence scores, and source CVE citations
# Generate both Semgrep and CodeQL rules
vulnrag generate --file server.go --language go --format both --output ./rules/
# Generate only Semgrep rules with more matches
vulnrag generate --file parser.c --language c --format semgrep --limit 5Semgrep rules: Go, Java, Python, JavaScript, TypeScript, C, C++, Rust, Ruby, PHP, C#, Kotlin, Scala, Swift
CodeQL queries: Go, Java, Python, JavaScript, C, C++
| Method | Path | Description |
|---|---|---|
GET |
/v2/health |
System health check (database + vector store status) |
GET |
/v2/stats |
Comprehensive statistics with dimension coverage |
| Method | Path | Description |
|---|---|---|
POST |
/v2/search/code |
Code similarity search with vulnerable/fixed code pairs |
POST |
/v2/search/pattern |
Structured search by attack class, protocol component, CWE, severity |
POST |
/v2/search/dimensions |
5-dimension knowledge search with RRF fusion |
POST |
/v2/hunt/variants |
Cross-project variant hunting for unmatched patterns |
GET |
/api/search |
General search (text or code) |
GET |
/api/search/attack-vector |
Search by attack technique name |
GET |
/api/search/component |
Search by project and component |
GET |
/api/search/similar-tech |
Cross-project search for same vulnerability class |
GET |
/api/search/chainable |
Find CVEs chainable with a given CVE |
POST |
/api/search/patch-pattern |
Search for patterns in vulnerability fixes |
POST |
/api/code-similarity |
Direct code similarity with enriched results |
| Method | Path | Description |
|---|---|---|
POST |
/v2/submit |
Submit a single vulnerability |
POST |
/v2/submit/bulk |
Submit up to 100 vulnerabilities in batch |
GET |
/v2/vulnerability/{id} |
Retrieve by internal ID or CVE identifier |
DELETE |
/v2/vulnerability/{id} |
Delete a vulnerability record |
POST |
/v2/reindex |
Trigger vector index rebuild |
POST |
/v2/context/preload |
Pre-warm retrieval cache for a project |
GET |
/v2/projects/{project}/cves |
List CVEs for a specific project |
| Method | Path | Description |
|---|---|---|
GET |
/api/research/stats |
Language comparisons, attack distributions, trends |
GET |
/api/research/php-deep-dive |
PHP ecosystem deep analysis |
GET |
/api/research/language/{lang} |
Detailed stats for a specific language |
GET |
/api/research/compare?languages=php,go,c |
Side-by-side language comparison |
| Method | Path | Description |
|---|---|---|
GET |
/api/cve/{id}/related-variants |
Find regressions and variants of a known CVE |
| Method | Path | Description |
|---|---|---|
POST |
/api/clanker/clone |
Clone and index a GitHub repository |
POST |
/api/clanker/ask |
Ask questions about an indexed codebase |
POST |
/api/clanker/search |
Raw vector search for strategy comparison |
GET |
/api/clanker/repos |
List indexed repositories |
Every vulnerability, regardless of source, is normalized to a unified Pydantic model:
vulnerability:
id: "unified-{source}-{cve_id}"
source_ids:
cve: "CVE-2023-XXXXX"
ghsa: "GHSA-xxxx-yyyy"
project:
name: "project-name"
language: "c"
repo_url: "https://github.com/org/repo"
component: "http2_codec"
classification:
cwe_id: "CWE-400"
cwe_name: "Uncontrolled Resource Consumption"
attack_class: "dos"
protocol_component: "http2_codec"
severity:
cvss_vector: "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H"
cvss_score: 7.5
severity_label: "HIGH"
trigger:
input_type: "http2_frame"
attacker_control: "rst_stream_flood"
code:
vulnerable:
file: "src/http2/codec.c"
function: "on_frame_received"
snippet: "// vulnerable code"
fixed:
file: "src/http2/codec.c"
function: "on_frame_received"
snippet: "// fixed code"
diff: "// unified diff"
metadata:
disclosed_date: "2023-10-10"
fix_commit: "abc123"
affected_versions: ["< 1.28.0"]
fixed_versions: ["1.28.0"]Attack Classes: smuggling, overflow, injection, dos, bypass, disclosure, ssrf, traversal, deserialization, race
Protocol Components: http1_parser, http2_codec, http3_codec, websocket, connect_handler, tls_handler, header_handling, body_handling, connection_pool, config_parser, lua_sandbox, wasm_sandbox, health_check, auth_filter, rate_limit
Dangerous Operations: buffer_alloc, memcpy, state_transition, resource_alloc, eval_execute
src/vulnrag/
├── cli.py # Typer CLI entry point
├── config/
│ └── settings.py # Pydantic settings (env vars, paths)
│
├── collectors/ # Source-specific data collection
│ ├── base.py # Base collector interface
│ ├── github_advisory.py # GitHub Security Advisories (GraphQL + REST)
│ ├── cvefixes.py # CVEfixes SQLite database
│ ├── morefixes.py # MoreFixes PostgreSQL dump importer
│ ├── nvd.py # National Vulnerability Database
│ ├── megavul.py # MegaVul HuggingFace dataset
│ ├── bigvul.py # BigVul HuggingFace dataset
│ ├── security_dpo.py # Security DPO dataset
│ └── code_fetcher.py # GitHub code extraction from commits
│
├── normalizers/ # Schema definition and transformation
│ ├── schema.py # Canonical vulnerability model (Pydantic)
│ ├── schema_v2.py # V2 schema with 5-dimensional knowledge
│ └── transformer.py # Source-to-canonical transformers
│
├── storage/ # Relational persistence
│ ├── models.py # SQLAlchemy ORM models
│ └── sqlite_store.py # CRUD operations
│
├── embeddings/ # Vector embedding generation
│ ├── code_embedder.py # Code embeddings (UnixCoder)
│ └── text_embedder.py # Text embeddings (BGE)
│
├── vectordb/ # Vector store management
│ ├── collections.py # ChromaDB collection manager
│ └── indexer.py # Indexing pipeline
│
├── retrieval/ # Multi-strategy RAG retrieval
│ ├── fusion.py # Weighted strategy fusion + dedup
│ ├── augmenter.py # Query augmentation (sinks, protocols)
│ └── hybrid.py # Hybrid retrieval combining strategies
│
├── api/ # REST API
│ ├── server.py # FastAPI application
│ ├── schemas.py # Request/response Pydantic models
│ ├── v2_endpoints.py # V2 dimension search, submission, variants
│ └── v3_endpoints.py # V3 experimental endpoints
│
├── web/ # Web UI
│ ├── app.py # FastAPI + Jinja2 web application
│ └── templates/ # HTML templates
│
├── generators/ # Detection rule generation
│ ├── base.py # Base generator interface
│ ├── semgrep.py # Semgrep YAML rule generator
│ ├── codeql.py # CodeQL query generator
│ └── mappings.py # Language-specific sink/source mappings
│
├── clanker/ # Code assistant for indexed repos
│ ├── agent.py # RAG-powered Q&A agent
│ ├── collections.py # Repository collection manager
│ ├── indexer.py # Repository code indexer
│ ├── parsers/ # Code parsing strategies
│ │ ├── ast_parser.py # AST-based parsing
│ │ ├── function_parser.py # Function-level extraction
│ │ └── chunk_parser.py # Fixed-size chunk parsing
│ ├── query_planner.py # Multi-query planning for complex questions
│ ├── query_transformer.py # Natural language to code query translation
│ ├── result_reranker.py # Result relevance reranking
│ └── enricher.py # Context enrichment
│
├── agent/ # Natural language query agent
│ └── query_agent.py # Translates questions into API calls
│
├── extractors/ # Knowledge dimension extraction
│ ├── dimension_extractor.py
│ └── batch_extractor.py
│
├── graph/ # Knowledge graph storage
│ ├── store.py
│ └── builder.py
│
└── utils/
├── logging.py # Structured logging
└── rate_limiter.py # GitHub API rate limiting
# Build the image
docker build -t vulnrag:latest .
# Run with persistent data volume
docker run -d \
--name vulnrag \
-p 8001:8001 \
-v vulnrag-data:/app/data \
-e GITHUB_TOKEN=your_token \
vulnrag:latest# Start
docker-compose up -d
# View logs
docker-compose logs -f
# Stop
docker-compose downThe Docker container runs as a non-root user, includes health checks, and allocates 2-4 GB memory for embedding model operations. Data persists across restarts via named volumes.
Search for vulnerability patterns from one project that might exist in another:
# Collect from two repositories
vulnrag collect --source github --owner <org-a> --project <repo-a> --with-code
vulnrag collect --source github --owner <org-b> --project <repo-b> --with-code
# Build the index
vulnrag index --rebuild
# Search for cross-project patterns while analyzing a code file
vulnrag search --file http2_handler.c --language c --project <repo-b>curl -X POST http://localhost:8000/v2/submit \
-H "Content-Type: application/json" \
-d '{
"cve_id": "CVE-2024-XXXXX",
"project": "my-project",
"language": "php",
"cwe_id": "CWE-434",
"attack_class": "injection",
"severity": "MEDIUM",
"cvss_score": 4.3,
"vulnerable_code": {
"snippet": "public function upload(Request $request) {\n $file = $request->file(\"file\");\n $file->move(public_path(\"uploads\"), $file->getClientOriginalName());\n}",
"file": "src/Controllers/UploadController.php"
},
"references": ["https://nvd.nist.gov/vuln/detail/CVE-2024-XXXXX"],
"auto_index": true
}'curl "http://localhost:8000/api/search/chainable?cve=CVE-2023-XXXXX&limit=10"Returns CVEs in the same project that could be chained with the given CVE based on complementary attack class relationships (e.g., information disclosure -> auth bypass -> injection).
# Analyze Go code and generate Semgrep rules from similar CVEs
vulnrag generate \
--file internal/proxy/handler.go \
--language go \
--format semgrep \
--output ./rules/ \
--limit 3Output: Semgrep YAML rules with pattern-matching based on CVE-matched sink functions, confidence scores, and CWE references.
# Compare vulnerability profiles across languages
curl "http://localhost:8000/api/research/compare?languages=c,go,rust,php"
# Get detailed stats for a specific language
curl "http://localhost:8000/api/research/language/c"# Import all C and C++ vulnerabilities with code
vulnrag collect --source cvefixes \
--db-path ./CVEfixes.db \
--language C
vulnrag collect --source cvefixes \
--db-path ./CVEfixes.db \
--language "C++"
# Rebuild the index
vulnrag index --rebuild
# Check statistics
vulnrag statsMIT