REXIS is an experimental framework that enhances static malware analysis with Large Language Models (LLMs) and Retrieval‑Augmented Generation (RAG). It explores how contextual retrieval from external knowledge sources can improve the accuracy, interpretability, and justifiability of LLM‑based malware classification compared to a static heuristic baseline.
Built for cybersecurity research, it focuses on static features (e.g., decompiler output, file structure, API calls) and offers two pipelines: a fast heuristic baseline and an LLM+RAG pipeline with guardrails.
- 📦 Two analysis pipelines: heuristic baseline and LLM+RAG
- 🧩 Ghidra/PyGhidra decompilation and feature extraction
- 🔍 Hybrid retrieval (dense + keyword) with optional re‑ranking
- �️ Guardrails for safe, explainable JSON classifications
- 🧭 Decision fusion with VirusTotal and taxonomy normalization
- 📊 Reproducible runs and batch summaries for evaluation
- 🧮 Built‑in results aggregation CLI for cross‑run metrics (accuracy, latency, interpretability)
- Static analysis and features
- Ghidra + PyGhidra — decompile and extract features
- RAG and LLM
- Vector store
- PostgreSQL + pgvector — hybrid retrieval
- CLI and configuration
- Typer + Rich — ergonomic CLI and output
- Dynaconf — typed settings and secrets
High‑level layout you’ll interact with most:
src/rexis/cli— Typer-based CLI entrypointscollect(Malpedia, MalwareBazaar)ingest(pdf, html, text, json, or genericfile)analyse(baseline, llmrag)decompile(Ghidra/PyGhidra‑based feature extraction)aggregate(stitch results across runs and compute metrics)
src/rexis/operations— implementation modulescollect/— Malpedia and MalwareBazaar collectorsingest/— content parsers and indexersbaseline.py— static heuristic baseline pipelinellmrag.py— RAG + LLM analysis pipelinedecompile/— Ghidra integration
src/rexis/tools/aggregate/— aggregation helpers (parsers, VT helpers, metrics)config/— Dynaconf settings and secretsdata/— sample datasets and collected artifacts (local only).docker/— Docker build context for Postgres + pgvector
Chapter 3 of the accompanying report (see main.pdf, Chapter 3) defines the study design and full results. Below is a summary and how to reproduce with this codebase. Use the PDF as the source of truth for dataset sizes and final numbers.
What is evaluated
- Binary classification quality: malicious/suspicious/benign
- Family/category tagging alignment (when ground truth exists)
- Retrieval quality and contribution to decisions
- Reliability: strict‑JSON validity, uncertainty, guardrail triggers
- Efficiency: latency and token/cost footprint per sample
Ground truth and datasets
- Samples curated from VX‑Underground/MalwareBazaar and labeled using VirusTotal metadata; family names normalized with Malpedia‑aware rules. See
guides/DataSourcing*andguides/Reconciliation.md.
Metrics
- Accuracy, Precision/Recall/F1 (macro), AUROC (binary collapse), calibration (Brier), Top‑k family accuracy, retrieval MRR@k/NDCG@k, JSON validity rate, average latency and token usage.
Experiment matrix (illustrative; see PDF for exact)
- Baseline: heuristics only vs heuristics⊕VT fusion
- LLMRAG variants: join mode (RRF vs merge), final_top_k ∈ {4,8,12}, rerank on/off, source filters, model choices
- Ablations: no‑retrieval LLM, retrieval‑only (no LLM), guardrails on/off
Reproducing (outline)
- Prepare the vector store and ingest corpora (see Ingestion below).
- Run Baseline on the evaluation set:
pdm run rexis analyse baseline -i <SAMPLES_DIR> -o ./data/analysis/baseline --parallel 4 --vt - Run LLM+RAG with desired knobs:
pdm run rexis analyse llmrag -i <SAMPLES_DIR> -o ./data/analysis/llmrag --final-top-k 8 - Aggregate reports and compute metrics with the built‑in CLI (writes CSV + JSON):
pdm run rexis aggregate --baseline-dir analysis/baseline/baseline-analysis-*-run-* --baseline-vt-dir analysis/baseline/baseline-analysis-*-run-vt-* --llmrag-dir analysis/llmrag/llmrag-analysis-*-run-* --out-dir analysis/aggregate
Per‑run *.report.json files capture configuration for reproducibility.
Notes
- Exact splits and metrics live in
main.pdf(Chapter 3). - VT is used for enrichment and sometimes for ground truth; vendor disagreement is reconciled per
guides/Reconciliation.md. - Guardrails down‑weight weak evidence and redact leaked family names when necessary.
REXIS targets Python 3.11–3.13 and is managed with PDM. You’ll also need PostgreSQL with the pgvector extension enabled.
- Python 3.11–3.13
- PDM (
pip install pdm) - PostgreSQL with pgvector extension
- OpenAI and/or DeepSeek API credentials
# Clone the repo
git clone https://github.com/andremmfaria/rexis
cd rexis
# Install dependencies
pdm install
# Create a ./config/.secrets.toml file for your API keys and database config
cp ./config/.secrets_template.toml ./config/.secrets.tomlSecrets keys and the database password are read by Dynaconf via config/settings.toml. Populate config/.secrets.toml with the following keys (values are placeholders):
db_password = "super_secret_password"
openai_api_key = "sk-..."
deepseek_api_key = "dseek-..."
malware_bazaar_api_key = "malw-bazaar-..."
virus_total_api_key = "vt-..."
Note: Use the key name malware_bazaar_api_key exactly as shown to match config/settings.toml.
Database connection defaults live in config/settings.toml:
[db]
host = "localhost"
port = 5432
name = "rexis"
user = "postgres"
password = "@get db_password"REXIS uses Docker for the database. Run the app locally (via PDM) and connect to Postgres running in Docker.
Services:
db: PostgreSQL with thepgvectorextension for vector-based semantic search
- Create your
.envfile in the root of the project by copying from.env.example:
POSTGRES_USER=postgres
POSTGRES_PASSWORD=super_secret_password
POSTGRES_DB=rexis- Build and start the database container:
docker compose up --build- App source code and configuration:
- Application code lives in
./src/ - Configuration files (via Dynaconf) are in
./config/
- Stopping the containers:
docker compose down-
Persistent data: PostgreSQL data is stored in a Docker volume named
pgdataand persists between restarts. -
Verify DB connection from app (optional):
- DB connection parameters are read from
config/settings.toml([db]section) and secrets inconfig/.secrets.toml.
The primary entry point is the rexis command.
Global options:
-v/-vvincrease verbosity-V/--versionprints the version and exits
Use -h or --help on any command/subcommand for details.
rexis --helpSubcommands:
collect— gather raw malware intelligencemalpedia– retrieve families and actors from Malpediamalwarebazaar– fetch samples from MalwareBazaar
ingest— normalise and index files into the vector storefile,pdf,html,text,json
analyse— run analysis pipelines over samplesbaseline,llmrag
decompile— decompile a binary and extract features via Ghidraaggregate— aggregate evaluation metrics across multiple runs
Helpers to gather raw intel before (optionally) ingesting.
rexis collect malpedia \
[--family-id ID] [--actor-id ID] [--search-term TEXT] \
[--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD] \
[--max N] [--run-name NAME] [--output-dir PATH] [--ingest]Options:
--family-id,--actor-id, or--search-termto filter--start-date,--end-dateto time-bound results--maxlimit items after filtering--run-namecustom run identifier; autogenerated if omitted--output-dirwhere JSON + scraped docs are written--ingestimmediately indexes discovered documents
Example:
rexis collect malpedia -s CobaltStrike --start-date 2024-01-01 --end-date 2024-12-31 -o data/malpedia --ingestrexis collect malwarebazaar \
[--tags TAGS] [--fetch-limit N] [--batch N] \
[--hash SHA256 | --hash-file FILE] \
[--run-name NAME] [--output-dir PATH] [--ingest]Options:
--tagscomma-separated tags (requires--batchfor ingestion sizing)--fetch-limitper-tag fetch cap--hashsingle SHA256, or--hash-filelist of hashes--run-name,--output-dir,--ingestas above
Example:
rexis collect malwarebazaar -t ransomware,exe --fetch-limit 50 -o data/malwarebazaar --ingestIndex files into the vector store. Provide exactly one of --dir or --file.
rexis ingest file --type [pdf|html|text|json] (--dir DIR | --file FILE) [--batch N] [-m key=value ...] [--out-dir PATH] [--run-name NAME]rexis ingest pdf (--dir DIR | --file FILE) [--batch N] [-m key=value ...] [--out-dir PATH] [--run-name NAME]
rexis ingest html (--dir DIR | --file FILE) [--batch N] [-m key=value ...] [--out-dir PATH] [--run-name NAME]
rexis ingest text (--dir DIR | --file FILE) [--batch N] [-m key=value ...] [--out-dir PATH] [--run-name NAME]
rexis ingest json (--dir DIR | --file FILE) [--batch N] [-m key=value ...] [--out-dir PATH] [--run-name NAME]Notes:
--batchcontrols chunking for bulk indexing-m/--metadataaccepts repeated key=value pairs (stored with documents)
Examples:
rexis ingest pdf --dir data/vxunderground/2022 -m source=vxug -m year=2022
rexis ingest json --file data/malwarebazaar/MbExe-20250816T161546Z.json -m source=malwarebazaarRun analysis over samples with either a static baseline or an LLM+RAG pipeline.
Two subcommands are exposed; depending on your branch/state, they may be WIP:
Common options:
rexis analyse baseline \
--input PATH [--out-dir PATH] [--run-name NAME] [--overwrite] [--format json] \
[--project-dir PATH] [--parallel N] [--rules FILE] [--min-severity info|warn|error] \
[--vt] [--vt-timeout SEC] [--vt-qpm N] [--audit/--no-audit]
rexis analyse llmrag \
--input PATH [--out-dir PATH] [--run-name NAME] [--overwrite] [--format json] \
[--project-dir PATH] [--parallel N] \
[--top-k-dense N] [--top-k-keyword N] [--final-top-k N] [--join rrf|merge] \
[--rerank-top-k N] [--ranker-model NAME] [--source NAME ...] \
[--model NAME] [--temperature F] [--max-tokens N] [--prompt-variant classification|justification|comparison] \
[--audit/--no-audit]Notes:
- Baseline can optionally enrich with VirusTotal (
--vt). - LLMRAG defaults: RRF fusion; generator model
gpt-4o-2024-08-06; ranker modelgpt-4o-mini.
Examples:
# Baseline over a single PE sample
pdm run rexis analyse baseline -i ./data/samples/c6e3....exe -o ./data/analysis/baseline
# LLM+RAG over a sample (as used in development)
pdm run rexis analyse llmrag -i ./data/samples/c6e3....exe -o ./data/analysis/llmrag --final-top-k 8Aggregate results across multiple runs to produce a per‑sample CSV and a JSON summary. Ground truth is taken from Baseline+VT run directory names when available (e.g., baseline-analysis-<category>-run-vt-*), otherwise derived from VirusTotal signals. See guides/ResultsAggregation.md for details and reproduction commands.
Usage (globs or directories supported):
rexis aggregate \
--baseline-dir analysis/baseline/baseline-analysis-*-run-* \
--baseline-vt-dir analysis/baseline/baseline-analysis-*-run-vt-* \
--llmrag-dir analysis/llmrag/llmrag-analysis-*-run-* \
[--out-dir PATH] [--debug] [--alpha F] [--beta F] [--gamma F]Notes:
- You can pass each of the directory options multiple times.
- Outputs:
<out-dir>/aggregation-report.csvand<out-dir>/aggregation-output.json. - Weights (
alpha,beta,gamma) control the composite score for accuracy, efficiency^{-1}, and interpretability; when any are set, they’re normalized.
Example:
pdm run rexis aggregate \
--baseline-dir "./analysis/baseline/baseline-analysis-*-run-*" \
--baseline-vt-dir "./analysis/baseline/baseline-analysis-*-run-vt-*" \
--llmrag-dir "./analysis/llmrag/llmrag-analysis-*-run-*" \
--out-dir ./analysisDecompile a binary and extract features using Ghidra.
rexis decompile --file FILE --out-dir DIR [--overwrite] [--project-dir PATH] [--project-name NAME] [--run-name NAME]Example:
pdm run rexis decompile -f ./data/samples/c6e3....exe -o ./data/decompiledcollectwrites JSON manifests and optional scraped artifactsingestnormalizes and indexes content into pgvectoranalyseretrieves context (for LLMRAG) and produces reportsaggregatejoins reports across runs and computes metrics (CSV + JSON)
src/rexis/cli— Typer-based CLI (collect, ingest, analyse, decompile, aggregate)src/rexis/operations/ingest— file-type-specific ingestionsrc/rexis/operations/baseline.py— static baseline pipelinesrc/rexis/operations/llmrag.py— LLMRAG pipeline (Haystack + OpenAI)src/rexis/operations/decompile— decompiler integration (Ghidra)config/— Dynaconf settings and secretsdata/— sample datasets and collected artifacts (gitignored in real usage)
- BaselinePipeline.md
- IngestionPipeline.md
- LLMRagPipeline.md
- WritingHeuristicRules.md
- Reconciliation.md
- ResultsAggregation.md
For questions or contributions, open an issue or pull request on GitHub.
The malware samples located in ./samples are provided strictly for research and educational purposes within controlled environments. Any other use, distribution, or misuse is the sole responsibility of the user. The owner of this repository assumes no liability for improper use.
This project is licensed under the MIT License.
Andre Faria
MSc in Applied Cybersecurity
Technological University Dublin — School of Informatics and Cyber Security
Research Project: Enhancing Static Malware Analysis with Large Language Models and Retrieval-Augmented Generation