Skip to content

Cross-project vulnerability intelligence system. Aggregates CVEs from multiple databases with source code, indexes them as vector embeddings, and retrieves patterns across projects and languages.

Notifications You must be signed in to change notification settings

chasingimpact/vulnrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VulnRAG

Cross-Project Vulnerability Intelligence System

About

VulnRAG is a vulnerability intelligence platform that aggregates known CVEs from multiple public databases, stores them with their actual source code (vulnerable and fixed versions), and makes them searchable by code similarity, natural language, attack classification, or any combination. It uses vector embeddings and multi-strategy retrieval to surface patterns across projects and languages -- turning scattered vulnerability data into a structured, queryable knowledge base for security research.

What it does

  1. Collects vulnerability data from 7 sources: GitHub Security Advisories, CVEfixes, NVD, MegaVul, BigVul, MoreFixes, and Security DPO. Each vulnerability is stored with its CVE ID, severity score, CWE classification, affected project, and -- when available -- the actual vulnerable and fixed source code.

  2. Indexes the collected data into searchable vector embeddings. Code snippets are embedded using UnixCoder (a model trained on source code across languages). Descriptions are embedded using BGE (a text embedding model). Both are stored in ChromaDB for fast similarity search.

  3. Retrieves relevant vulnerabilities using four strategies that run in parallel and get merged into a single ranked result set:

    • Code similarity: "Does this code look like known vulnerable code?"
    • Pattern matching: "What CVEs involve this type of dangerous function in this protocol component?"
    • Cross-project search: "This pattern caused a CVE in project A -- does project B have something similar?"
    • Conceptual search: "Find everything related to HTTP request smuggling via chunked encoding"
  4. Generates Semgrep rules and CodeQL queries from matched vulnerabilities, so you can scan codebases for the same patterns.


Table of Contents


Architecture

                         +------------------+
                         |   Data Sources   |
                         | GitHub | CVEfixes|
                         | NVD | MegaVul   |
                         | BigVul | MoreFix |
                         +--------+---------+
                                  |
                         +--------v---------+
                         |    Collectors     |
                         |  (7 source-      |
                         |   specific       |
                         |   adapters)      |
                         +--------+---------+
                                  |
                         +--------v---------+
                         |   Normalizers    |
                         | Canonical Schema |
                         | Transformers     |
                         +--------+---------+
                                  |
                    +-------------+-------------+
                    |                           |
           +--------v---------+       +--------v---------+
           |   SQLite Store   |       |    ChromaDB      |
           | Vulnerabilities  |       | Code Embeddings  |
           | Code Samples     |       | Desc Embeddings  |
           | Versions         |       | Pattern Index    |
           | References       |       | Dimension Index  |
           +--------+---------+       +--------+---------+
                    |                           |
                    +-------------+-------------+
                                  |
                         +--------v---------+
                         | Retrieval Fusion |
                         | Code Similarity  |
                         | Pattern-Based    |
                         | Cross-Project    |
                         | Conceptual       |
                         +--------+---------+
                                  |
              +-------------------+-------------------+
              |                   |                   |
     +--------v------+  +--------v------+  +---------v-----+
     |     CLI       |  |   REST API    |  |    Web UI     |
     | vulnrag ...   |  | FastAPI v2/v3 |  | Search/Browse |
     +---------------+  +---------------+  +---------------+

Data Flow

  1. Collectors pull raw vulnerability data from external sources (GitHub GraphQL API, SQLite databases, HuggingFace datasets, NVD REST API)
  2. Transformers normalize each record to a canonical Pydantic schema, classifying attack type, protocol component, and dangerous operations
  3. SQLite Store persists the relational data (vulnerabilities, code samples, version info, references) with composite indexes for fast filtering
  4. Vector Indexer generates embeddings and populates ChromaDB collections with filterable metadata
  5. Retrieval Fusion executes multiple search strategies in parallel, deduplicates, applies multi-strategy boosting, and returns ranked results
  6. Consumers (CLI, API, Web UI) present results with CVE references, code context, and cross-project citations

Data Pipeline

Phase 1: Collection

Each data source has a dedicated collector that knows how to pull and parse its format:

Source What it provides How it's collected
GitHub Security Advisories Real-time advisories with commit references GraphQL API queries per repository
CVEfixes Historical vulnerable/fixed code pairs SQLite database import
MoreFixes Extended fix dataset PostgreSQL dump import
NVD CVSS scores, CWE IDs, version ranges REST API queries
MegaVul 43,690 vulnerable function samples HuggingFace dataset download
BigVul Large-scale vulnerability dataset HuggingFace dataset download
Security DPO Multi-language vulnerability samples HuggingFace dataset download

Phase 2: Normalization and Indexing

Every record is converted to a single canonical schema (see Canonical Schema) regardless of which source it came from. This means a CVE imported from NVD and the same CVE imported from CVEfixes get merged into one record with combined metadata.

The normalized records are then embedded into three ChromaDB collections:

Collection What's embedded Used for
vuln_by_code Vulnerable code snippets (via UnixCoder) "Show me CVEs with code that looks like this"
vuln_by_description Vulnerability descriptions (via BGE) "Show me CVEs that match this description"
vuln_by_pattern Structured metadata (attack class, CWE, protocol) "Show me all buffer overflows in HTTP/2 parsers"

Phase 3: Retrieval

When you search, four strategies run in parallel and their results are fused:

  1. Your query code or text gets embedded using the same models
  2. Each strategy searches its relevant collection(s) with optional filters
  3. Results are deduplicated across strategies
  4. Items found by multiple strategies get a confidence boost
  5. Final results are ranked by combined score

Retrieval Strategies

1. Code Similarity (weight: 1.0)

Embeds a target code snippet using UnixCoder and queries vuln_by_code. Returns vulnerabilities with structurally or semantically similar vulnerable code, even across languages.

When to use: You have a piece of code and want to know if it resembles any known vulnerable pattern.

2. Pattern-Based (weight: 0.8)

Filters by extracted characteristics: protocol component, dangerous sink functions, and CWE identifiers. Searches the description collection with metadata constraints.

When to use: You know what kind of vulnerability you're looking for (e.g., "resource exhaustion in connection pool handling").

3. Cross-Project (weight: 0.7)

Queries for CVEs in other projects that share the same protocol component or attack class. The current project is explicitly excluded to surface patterns that exist elsewhere but haven't been found locally.

When to use: You're reviewing one project and want to know what bugs similar projects have had in the same area.

4. Conceptual (weight: 0.5)

Embeds a natural language query using BGE and searches descriptions without metadata filters. Catches conceptual relationships that don't map to specific code patterns.

When to use: Broad exploratory queries like "rapid reset denial of service variants across implementations".

Score Fusion

Results from all strategies are combined using Reciprocal Rank Fusion (RRF):

RRF Score = sum(1 / (k + rank_i)) for each strategy

If the same CVE shows up in multiple strategies, its score gets a 1.2x boost -- the logic being that if both code similarity and cross-project search flag the same vulnerability, it's more likely to be a real match.

Query Augmentation

Before retrieval, the system analyzes target code to extract searchable characteristics:

  • Dangerous sinks: Language-specific functions known to be security-sensitive (e.g., memcpy, sprintf, exec, eval)
  • Protocol hints: Detected protocol handling patterns (HTTP/1.x parsing, HTTP/2 frames, WebSocket, TLS)
  • Risk estimation: Heuristic risk level based on detected sinks and patterns

These extracted features are added to the query to improve retrieval precision beyond what pure embedding similarity provides.


Data Sources

CVEfixes Database

Pre-structured vulnerable/fixed code pairs at method-level granularity with CWE mappings. Approximately 12,000 CVEs across languages. Available from Zenodo.

GitHub Security Advisories

Real-time advisory data via GitHub's GraphQL API. For each repository, the collector fetches advisory metadata, affected version ranges, and references to fixing commits. When --with-code is enabled, it clones the repository at the pre-fix and post-fix commits to extract actual code changes.

National Vulnerability Database

NVD records provide CVSS scoring, CWE classification, and version range information. Used to enrich existing records and fill metadata gaps.

HuggingFace Datasets

  • MegaVul (hitoshura25/megavul): 43,690 vulnerable function samples with rich context
  • BigVul (bstee615/bigvul): Large-scale vulnerable function dataset with binary vulnerability labels
  • Security DPO (CyberNative/Code_Vulnerability_Security_DPO): Multi-language vulnerability samples

Supplementary Research Sources

  • Published security research (HTTP request smuggling, protocol-level attacks)
  • Public bug bounty disclosures
  • Academic papers (IEEE S&P, USENIX Security, CCS)
  • Security mailing lists (oss-security, full-disclosure)

Technology Stack

Layer Technology Purpose
Language Python 3.10+ Core runtime
CLI Typer + Rich Command-line interface
Web Framework FastAPI + Uvicorn REST API and web UI
Templating Jinja2 HTML rendering for web UI
ORM SQLAlchemy 2.0 Relational data persistence
Database SQLite Primary vulnerability store
Vector Store ChromaDB 0.4+ Embedding storage and similarity search
Code Embeddings sentence-transformers (UnixCoder) Code semantic representations
Text Embeddings sentence-transformers (BGE) Description semantic representations
Deep Learning PyTorch 2.0+ Embedding model inference
HTTP Client httpx Async API calls to external APIs
GraphQL gql GitHub Security Advisory queries
Git GitPython Repository cloning and diff extraction
Scraping BeautifulSoup + lxml Advisory page parsing
Validation Pydantic 2.0+ Schema validation and settings
Containerization Docker Production deployment

Installation

Prerequisites

  • Python 3.10 or higher
  • Git
  • 4+ GB RAM (for embedding models)

From Source

git clone https://github.com/chasingimpact/vulnrag.git
cd vulnrag

# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # Linux/macOS

# Install the package and dependencies
pip install -e .

Verify Installation

vulnrag --help

Configuration

Copy the example environment file and configure:

cp .env.example .env

Required

Variable Description
GITHUB_TOKEN GitHub Personal Access Token with public_repo and read:org scopes. Create one here.

Optional

Variable Default Description
ANTHROPIC_API_KEY (none) LLM API key for natural language query agent and code assistant features
DATA_DIR ./data Directory for databases, vector stores, and repo caches
GITHUB_REQUESTS_PER_HOUR 4000 GitHub API rate limit (max 5000)
CODE_EMBEDDING_MODEL microsoft/unixcoder-base Model for code embeddings
TEXT_EMBEDDING_MODEL BAAI/bge-base-en-v1.5 Model for text embeddings

Usage

CLI Reference

# Initialize project directories and verify configuration
vulnrag init

# Collect from GitHub Security Advisories (metadata only)
vulnrag collect --source github --owner <org> --project <repo>

# Collect with actual vulnerable/fixed code from commits
vulnrag collect --source github --owner <org> --project <repo> --with-code

# Import from CVEfixes database
vulnrag collect --source cvefixes --db-path ./CVEfixes.db

# Import specific language from CVEfixes
vulnrag collect --source cvefixes --language C --limit 500

# Import from MoreFixes PostgreSQL dump
vulnrag collect --source morefixes --db-path ./morefixes.sql

# Collect from NVD
vulnrag collect --source nvd --project <name>

# Import from MegaVul dataset
vulnrag collect --source megavul

# Import from BigVul dataset
vulnrag collect --source bigvul

# Import from Security DPO dataset
vulnrag collect --source security_dpo --language python

# Build or update the vector index
vulnrag index

# Rebuild index from scratch
vulnrag index --rebuild

# Search by code file
vulnrag search --file target.c --language c --project myproject

# Search by natural language
vulnrag search --query "HTTP/2 rapid reset denial of service"

# Search with specific strategy
vulnrag search --query "buffer overflow in header parsing" --strategy code_similarity

# Export a vulnerability record
vulnrag export --id CVE-2023-44487 --format yaml

# Enrich vulnerabilities with code from referenced commits
vulnrag enrich --limit 100

# Generate detection rules from RAG-matched vulnerabilities
vulnrag generate --file target.go --language go --format both --output ./rules/

# View database and index statistics
vulnrag stats

# Start the web server
vulnrag serve --host 127.0.0.1 --port 8000

Web Interface

Start the server and navigate to http://127.0.0.1:8000:

  • Home (/) -- Dashboard with database statistics, project distribution, and language breakdown
  • Search (/search) -- Full-text and code similarity search interface
  • Browse (/browse) -- Filter vulnerabilities by project, attack class, severity, and code availability
  • Detail (/vuln/{id}) -- Individual vulnerability view with code samples, metadata, and references
  • Research (/research) -- Interactive research dashboard with statistical visualizations
  • Clanker (/clanker) -- Code assistant for indexed repositories

REST API

The server exposes a comprehensive REST API. Interactive documentation is available at /docs (Swagger UI).


Rule Generation

VulnRAG can generate Semgrep rules and CodeQL queries from matched vulnerabilities. Given a target code file, it:

  1. Analyzes the code to detect dangerous sink functions and protocol patterns
  2. Retrieves the most similar known vulnerabilities from the index
  3. Generates detection rules targeting the matched vulnerability patterns
  4. Includes CWE references, confidence scores, and source CVE citations
# Generate both Semgrep and CodeQL rules
vulnrag generate --file server.go --language go --format both --output ./rules/

# Generate only Semgrep rules with more matches
vulnrag generate --file parser.c --language c --format semgrep --limit 5

Supported Languages

Semgrep rules: Go, Java, Python, JavaScript, TypeScript, C, C++, Rust, Ruby, PHP, C#, Kotlin, Scala, Swift

CodeQL queries: Go, Java, Python, JavaScript, C, C++


API Endpoints

Health and Statistics

Method Path Description
GET /v2/health System health check (database + vector store status)
GET /v2/stats Comprehensive statistics with dimension coverage

Search

Method Path Description
POST /v2/search/code Code similarity search with vulnerable/fixed code pairs
POST /v2/search/pattern Structured search by attack class, protocol component, CWE, severity
POST /v2/search/dimensions 5-dimension knowledge search with RRF fusion
POST /v2/hunt/variants Cross-project variant hunting for unmatched patterns
GET /api/search General search (text or code)
GET /api/search/attack-vector Search by attack technique name
GET /api/search/component Search by project and component
GET /api/search/similar-tech Cross-project search for same vulnerability class
GET /api/search/chainable Find CVEs chainable with a given CVE
POST /api/search/patch-pattern Search for patterns in vulnerability fixes
POST /api/code-similarity Direct code similarity with enriched results

Data Management

Method Path Description
POST /v2/submit Submit a single vulnerability
POST /v2/submit/bulk Submit up to 100 vulnerabilities in batch
GET /v2/vulnerability/{id} Retrieve by internal ID or CVE identifier
DELETE /v2/vulnerability/{id} Delete a vulnerability record
POST /v2/reindex Trigger vector index rebuild
POST /v2/context/preload Pre-warm retrieval cache for a project
GET /v2/projects/{project}/cves List CVEs for a specific project

Research and Analytics

Method Path Description
GET /api/research/stats Language comparisons, attack distributions, trends
GET /api/research/php-deep-dive PHP ecosystem deep analysis
GET /api/research/language/{lang} Detailed stats for a specific language
GET /api/research/compare?languages=php,go,c Side-by-side language comparison

Variant and Regression Analysis

Method Path Description
GET /api/cve/{id}/related-variants Find regressions and variants of a known CVE

Code Assistant

Method Path Description
POST /api/clanker/clone Clone and index a GitHub repository
POST /api/clanker/ask Ask questions about an indexed codebase
POST /api/clanker/search Raw vector search for strategy comparison
GET /api/clanker/repos List indexed repositories

Canonical Vulnerability Schema

Every vulnerability, regardless of source, is normalized to a unified Pydantic model:

vulnerability:
  id: "unified-{source}-{cve_id}"
  source_ids:
    cve: "CVE-2023-XXXXX"
    ghsa: "GHSA-xxxx-yyyy"

  project:
    name: "project-name"
    language: "c"
    repo_url: "https://github.com/org/repo"
    component: "http2_codec"

  classification:
    cwe_id: "CWE-400"
    cwe_name: "Uncontrolled Resource Consumption"
    attack_class: "dos"
    protocol_component: "http2_codec"

  severity:
    cvss_vector: "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H"
    cvss_score: 7.5
    severity_label: "HIGH"

  trigger:
    input_type: "http2_frame"
    attacker_control: "rst_stream_flood"

  code:
    vulnerable:
      file: "src/http2/codec.c"
      function: "on_frame_received"
      snippet: "// vulnerable code"
    fixed:
      file: "src/http2/codec.c"
      function: "on_frame_received"
      snippet: "// fixed code"
    diff: "// unified diff"

  metadata:
    disclosed_date: "2023-10-10"
    fix_commit: "abc123"
    affected_versions: ["< 1.28.0"]
    fixed_versions: ["1.28.0"]

Taxonomies

Attack Classes: smuggling, overflow, injection, dos, bypass, disclosure, ssrf, traversal, deserialization, race

Protocol Components: http1_parser, http2_codec, http3_codec, websocket, connect_handler, tls_handler, header_handling, body_handling, connection_pool, config_parser, lua_sandbox, wasm_sandbox, health_check, auth_filter, rate_limit

Dangerous Operations: buffer_alloc, memcpy, state_transition, resource_alloc, eval_execute


Project Structure

src/vulnrag/
├── cli.py                    # Typer CLI entry point
├── config/
│   └── settings.py           # Pydantic settings (env vars, paths)
│
├── collectors/               # Source-specific data collection
│   ├── base.py               # Base collector interface
│   ├── github_advisory.py    # GitHub Security Advisories (GraphQL + REST)
│   ├── cvefixes.py           # CVEfixes SQLite database
│   ├── morefixes.py          # MoreFixes PostgreSQL dump importer
│   ├── nvd.py                # National Vulnerability Database
│   ├── megavul.py            # MegaVul HuggingFace dataset
│   ├── bigvul.py             # BigVul HuggingFace dataset
│   ├── security_dpo.py       # Security DPO dataset
│   └── code_fetcher.py       # GitHub code extraction from commits
│
├── normalizers/              # Schema definition and transformation
│   ├── schema.py             # Canonical vulnerability model (Pydantic)
│   ├── schema_v2.py          # V2 schema with 5-dimensional knowledge
│   └── transformer.py        # Source-to-canonical transformers
│
├── storage/                  # Relational persistence
│   ├── models.py             # SQLAlchemy ORM models
│   └── sqlite_store.py       # CRUD operations
│
├── embeddings/               # Vector embedding generation
│   ├── code_embedder.py      # Code embeddings (UnixCoder)
│   └── text_embedder.py      # Text embeddings (BGE)
│
├── vectordb/                 # Vector store management
│   ├── collections.py        # ChromaDB collection manager
│   └── indexer.py            # Indexing pipeline
│
├── retrieval/                # Multi-strategy RAG retrieval
│   ├── fusion.py             # Weighted strategy fusion + dedup
│   ├── augmenter.py          # Query augmentation (sinks, protocols)
│   └── hybrid.py             # Hybrid retrieval combining strategies
│
├── api/                      # REST API
│   ├── server.py             # FastAPI application
│   ├── schemas.py            # Request/response Pydantic models
│   ├── v2_endpoints.py       # V2 dimension search, submission, variants
│   └── v3_endpoints.py       # V3 experimental endpoints
│
├── web/                      # Web UI
│   ├── app.py                # FastAPI + Jinja2 web application
│   └── templates/            # HTML templates
│
├── generators/               # Detection rule generation
│   ├── base.py               # Base generator interface
│   ├── semgrep.py            # Semgrep YAML rule generator
│   ├── codeql.py             # CodeQL query generator
│   └── mappings.py           # Language-specific sink/source mappings
│
├── clanker/                  # Code assistant for indexed repos
│   ├── agent.py              # RAG-powered Q&A agent
│   ├── collections.py        # Repository collection manager
│   ├── indexer.py             # Repository code indexer
│   ├── parsers/              # Code parsing strategies
│   │   ├── ast_parser.py     # AST-based parsing
│   │   ├── function_parser.py # Function-level extraction
│   │   └── chunk_parser.py   # Fixed-size chunk parsing
│   ├── query_planner.py      # Multi-query planning for complex questions
│   ├── query_transformer.py  # Natural language to code query translation
│   ├── result_reranker.py    # Result relevance reranking
│   └── enricher.py           # Context enrichment
│
├── agent/                    # Natural language query agent
│   └── query_agent.py        # Translates questions into API calls
│
├── extractors/               # Knowledge dimension extraction
│   ├── dimension_extractor.py
│   └── batch_extractor.py
│
├── graph/                    # Knowledge graph storage
│   ├── store.py
│   └── builder.py
│
└── utils/
    ├── logging.py            # Structured logging
    └── rate_limiter.py       # GitHub API rate limiting

Deployment

Docker

# Build the image
docker build -t vulnrag:latest .

# Run with persistent data volume
docker run -d \
  --name vulnrag \
  -p 8001:8001 \
  -v vulnrag-data:/app/data \
  -e GITHUB_TOKEN=your_token \
  vulnrag:latest

Docker Compose

# Start
docker-compose up -d

# View logs
docker-compose logs -f

# Stop
docker-compose down

The Docker container runs as a non-root user, includes health checks, and allocates 2-4 GB memory for embedding model operations. Data persists across restarts via named volumes.


Examples

Cross-Project Pattern Search

Search for vulnerability patterns from one project that might exist in another:

# Collect from two repositories
vulnrag collect --source github --owner <org-a> --project <repo-a> --with-code
vulnrag collect --source github --owner <org-b> --project <repo-b> --with-code

# Build the index
vulnrag index --rebuild

# Search for cross-project patterns while analyzing a code file
vulnrag search --file http2_handler.c --language c --project <repo-b>

Submitting a New Vulnerability via API

curl -X POST http://localhost:8000/v2/submit \
  -H "Content-Type: application/json" \
  -d '{
    "cve_id": "CVE-2024-XXXXX",
    "project": "my-project",
    "language": "php",
    "cwe_id": "CWE-434",
    "attack_class": "injection",
    "severity": "MEDIUM",
    "cvss_score": 4.3,
    "vulnerable_code": {
      "snippet": "public function upload(Request $request) {\n    $file = $request->file(\"file\");\n    $file->move(public_path(\"uploads\"), $file->getClientOriginalName());\n}",
      "file": "src/Controllers/UploadController.php"
    },
    "references": ["https://nvd.nist.gov/vuln/detail/CVE-2024-XXXXX"],
    "auto_index": true
  }'

Finding Chainable Vulnerabilities

curl "http://localhost:8000/api/search/chainable?cve=CVE-2023-XXXXX&limit=10"

Returns CVEs in the same project that could be chained with the given CVE based on complementary attack class relationships (e.g., information disclosure -> auth bypass -> injection).

Generating Detection Rules

# Analyze Go code and generate Semgrep rules from similar CVEs
vulnrag generate \
  --file internal/proxy/handler.go \
  --language go \
  --format semgrep \
  --output ./rules/ \
  --limit 3

Output: Semgrep YAML rules with pattern-matching based on CVE-matched sink functions, confidence scores, and CWE references.

Research Analytics

# Compare vulnerability profiles across languages
curl "http://localhost:8000/api/research/compare?languages=c,go,rust,php"

# Get detailed stats for a specific language
curl "http://localhost:8000/api/research/language/c"

Bulk Import from CVEfixes

# Import all C and C++ vulnerabilities with code
vulnrag collect --source cvefixes \
  --db-path ./CVEfixes.db \
  --language C

vulnrag collect --source cvefixes \
  --db-path ./CVEfixes.db \
  --language "C++"

# Rebuild the index
vulnrag index --rebuild

# Check statistics
vulnrag stats

License

MIT

About

Cross-project vulnerability intelligence system. Aggregates CVEs from multiple databases with source code, indexes them as vector embeddings, and retrieves patterns across projects and languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages