feed-comparison

A reproducible command-line tool to compare and benchmark feeds of malicious URLs: download samples from threat-intelligence providers, normalise the URLs they expose with Google Safe Browsing-style canonicalisation, and quantify how the feeds overlap (SuperVenn) and how their discovery times compare (CDF of per-URL deltas).

Citation. If you use this tool in academic work, please cite it via the metadata in CITATION.cff. A Zenodo DOI will be attached to the first tagged release.

Funding. This work was carried out within AI4CTI (a joint research initiative between Politecnico di Torino and Ermes Browser Security) and is funded by the Italian Ministry of Education, Grant FISA-2023-00168.

Background

Threat-intelligence feeds for malicious URLs are usually consumed in isolation, but their coverage, freshness and mutual overlap vary significantly. This makes it hard to answer practical questions like "if I already subscribe to feed A, what extra signal does feed B give me?" or "how many days earlier does feed A surface a phishing campaign compared to feed B?". feed-comparison provides a small, reproducible workflow to answer those questions on the feeds you have access to.

Features in v0.1.0

Five feed integrations out of the box: PhishStats (anonymous), PhishTank (free username), urlscan.io (API token), MISP (self-hosted instance, optional extra) and the Ermes CTI Feed (STIX/TAXII over OAuth 2.0, optional extra).
URL canonicalisation following the Google Safe Browsing approach (host normalisation, IDNA, percent-encoding, query-string sorting...). 92% line coverage in tests against ~60 reference cases.
Overlap analysis with SuperVenn plots over hostname, registered domain or full normalised URL.
Time-delta analysis: CDF of per-URL discovery deltas relative to a chosen benchmark feed.
Discoverable CLI built with Typer: list-feeds, download, compare, plot. Rich-formatted output, JSON mode for scripting, layered configuration via .env/environment.
Reproducible builds: PEP 621 pyproject.toml, uv-managed lockfile, hatchling build backend, GitHub Actions CI on Python 3.10 to 3.13.

See the Limitations section below for what isn't shipped in v0.1.0.

Install

The package is not yet on PyPI; for the v0.1.0 release we recommend installing from source:

git clone https://github.com/AI4CTI/feed-comparison.git
cd feed-comparison
uv tool install .
# or, with pipx:
# pipx install .

If you want to develop on the codebase:

uv sync --extra dev
uv run pre-commit install
uv run pytest

The MISP and Ermes integrations are opt-in extras (they pull in additional dependencies):

# MISP self-hosted (heavy `pymisp` dependency)
uv tool install '.[misp]'

# Ermes CTI Feed (STIX/TAXII + OAuth 2.0 client credentials)
uv tool install '.[ermes]'

# Both at once
uv tool install '.[misp,ermes]'

Quickstart

# 1. List the feeds the tool knows about, and which credentials they need.
feed-comparison list-feeds

# 2. Download a one-day sample from PhishStats (no credentials required).
feed-comparison download phishstats --days 1 --output-dir ./output

# 3. Compare two feeds: SuperVenn + time-delta CDF in ./output.
feed-comparison compare phishstats phishtank --days 1 --benchmark phishstats

# 4. Same comparison but on a 30-day window, dropping the most recent
#    10 days as a "settling" buffer (lets feeds with slower submission
#    pipelines catch up before measuring overlap).
feed-comparison compare phishstats phishtank --days 30 --ignore-last-days 10

# 5. Re-render plots from previously saved CSVs without re-downloading.
feed-comparison plot supervenn ./output/dataframe_*.csv --metric domain

The time-delta CDF measures per-URL deltas between two feeds only on the URLs they observed in common (intersection on the fully-canonicalised URL, not on hostname or domain). Two feeds that publish the same domain under different paths therefore won't intersect — keep this in mind when interpreting an unexpectedly small intersection size in the legend.

Configuration

Per-feed credentials are read from environment variables (and from a .env file in the current working directory if present). Only the variables for the feeds you actually use are required:

Env var	Used by	Notes
`MISP_URL`	MISP	Base URL of your self-hosted MISP instance
`MISP_KEY`	MISP	API key
`PHISHTANK_USERNAME`	PhishTank	Free username for the User-Agent string
`URLSCAN_URL`	urlscan.io	Search API endpoint, e.g. `.../api/v1/search/`
`URLSCAN_TOKEN`	urlscan.io	API token
`ERMES_API_SERVER`	Ermes	Base URL of the Ermes CTI Feed service
`ERMES_CLIENT_ID`	Ermes	OAuth 2.0 Client Credentials — client id
`ERMES_CLIENT_SECRET`	Ermes	OAuth 2.0 Client Credentials — client secret
`FEED_COMPARISON_OUTPUT_DIR`	global	Default output directory

A reference template lives in .env.example.

Available feeds

Name	Provider	Credentials
`phishstats`	https://phishstats.info/	none
`phishtank`	https://phishtank.org/	free username
`urlscan`	https://urlscan.io/	endpoint URL + API token
`misp`	https://www.misp-project.org/	self-hosted instance URL + API key (extra `[misp]`)
`ermes`	https://www.ermes.company/	OAuth 2.0 endpoint + client id + client secret (extra `[ermes]`)

feed-comparison list-feeds --json prints the same catalogue in machine-readable form for scripting.

Limitations

The original internal version supported additional commercial feeds (BitDefender, BrightCloud, zVelo PhishBlockList) and a "compare-protection" mode that queried Ermes' MongoDB. These are not part of the public release. See CHANGELOG.md for the full list of removed components and the rationale.
The original internal version also supported ~90 OSINT block-lists fetched from a private S3 bucket. A public-friendly OSINT downloader is on the roadmap for v0.2.x; in v0.1.0 only the five API-based feeds above are available.
phishstats.info is occasionally rate-limited or unavailable upstream (HTTP 5xx via Cloudflare). The tool reports this with a warning and exits gracefully.

Contributing

Contributions are welcome — see CONTRIBUTING.md for the development setup, coding conventions and how to add new feed integrations.

Security

Please report vulnerabilities privately via the channel documented in SECURITY.md. Do not open public issues for security-sensitive matters.

License

This project is distributed under the GNU Affero General Public License v3.0 or later. See LICENSE for the full text.

The AGPL choice means that running a modified version as a network service requires sharing the modified source code with the users of that service. We picked AGPLv3 to keep the tool, and any derivative offered as a hosted service, fully open as a deliverable of a publicly-funded research project.

Acknowledgements

feed-comparison was originally developed inside Ermes Browser Security and is being released as open source under the AI4CTI joint research initiative with the Politecnico di Torino, funded by the Italian Ministry of Education, Grant FISA-2023-00168.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
src/feed_comparison		src/feed_comparison
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

feed-comparison

Background

Features in v0.1.0

Install

Quickstart

Configuration

Available feeds

Limitations

Contributing

Security

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

feed-comparison

Background

Features in v0.1.0

Install

Quickstart

Configuration

Available feeds

Limitations

Contributing

Security

License

Acknowledgements

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages