Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
5ecb006
Add sites_extra.json, example_site checker, unit tests, and README_EXTRA
dmoney96 Oct 5, 2025
d900ddc
chore: make maigret/ and maigret/sites/ python packages (add __init__…
dmoney96 Oct 5, 2025
78be6b5
Move example_site to separate package (avoid colliding with maigret.s…
dmoney96 Oct 5, 2025
a198c1d
tests: add test_mastodon_resolver.py (mocked)
dmoney96 Oct 5, 2025
e497bef
feat: add example checkers (mastodon, tiktok) and load_extras helper …
dmoney96 Oct 5, 2025
e17dee4
test: add local stub for socid_extractor to satisfy imports during dev
dmoney96 Oct 5, 2025
083c2fe
chore(test): add top-level socid_extractor shim pointing to local stub
dmoney96 Oct 5, 2025
798ff31
chore(test): add parse fallback to socid_extractor shim
dmoney96 Oct 5, 2025
67a8ad0
test: expand socid_extractor shim to include __version__, parse, muta…
dmoney96 Oct 6, 2025
e301b93
test: expand socid_extractor shim to return mapping and accept cookie…
dmoney96 Oct 6, 2025
a7368fa
feat: add NewSite extra checker + tests, register in sites_extra.json
dmoney96 Oct 6, 2025
78f135b
chore: add sites_extra.json with NewSite opt-in entry
dmoney96 Oct 6, 2025
e9c9d76
feat: add example NewSite checker in maigret_sites_example
dmoney96 Oct 6, 2025
a43db2e
test: add tests for NewSite checker (mocked requests)
dmoney96 Oct 6, 2025
25222a4
fix: read SHODAN_API_KEY at runtime so tests can monkeypatch env; ens…
dmoney96 Oct 6, 2025
1f7c81c
docs: add README_EXTRA.md and register shodan in sites_extra.json
dmoney96 Oct 6, 2025
7a9257a
chore: remove local socid_extractor dev shim before upstream
dmoney96 Oct 6, 2025
205a54b
ci: add manual Shodan integration workflow
dmoney96 Oct 6, 2025
8f1a9a9
chore: mark socid_extractor as DEV SHIM and document it
dmoney96 Oct 6, 2025
71c4e57
ci: add manual Shodan integration workflow and integration test
dmoney96 Oct 6, 2025
b645ce7
feat(example): add mastodon API-style checker skeleton + unit tests
dmoney96 Oct 6, 2025
78c1fb3
chore: remove local DEV shim before upstream PR
dmoney96 Oct 6, 2025
1191276
fix(mastodon): enable parsing_enabled when resolver finds account
dmoney96 Oct 6, 2025
7575674
feat: add Mastodon API resolver + checker + unit tests
dmoney96 Oct 8, 2025
964715e
chore: save WIP
dmoney96 Oct 8, 2025
3110a19
chore: final tweaks for mastodon checker
dmoney96 Oct 8, 2025
69dd651
chore: add dev deps and docs for opt-in extras
dmoney96 Oct 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/integration-shodan.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Integration - Shodan (manual)

on:
workflow_dispatch: {}

jobs:
shodan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.12'
- name: Install dev deps
run: |
python -m pip install --upgrade pip
pip install -r requirements-dev.txt
- name: Run Shodan integration test
env:
SHODAN_API_KEY: ${{ secrets.SHODAN_API_KEY }}
MAIGRET_EXTRA_SITES: ${{ github.workspace }}/sites_extra.json
run: |
PYTHONPATH="${{ github.workspace }}" pytest -q tests/test_shodan_integration.py -q
17 changes: 17 additions & 0 deletions README_EXTRA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Opt-in extra checkers (maigretexpanded)

This fork supports *optional* additional site checkers (API-backed or extra scraping) kept separate from upstream Maigret.

## Enabling extras locally

1. Put extra-site definitions in `sites_extra.json` (root). Each key is an extra site id. The loader will read this file when `MAIGRET_EXTRA_SITES` points to it.

2. Example: enable extras at runtime:
```bash
export MAIGRET_EXTRA_SITES="$(pwd)/sites_extra.json"


## Dev shims
This branch included small local shims (e.g. `maigret_sites_example/socid_extractor.py`) to let the
test suite run without pulling every heavy dependency. These are marked `DEV SHIM` and should be replaced
by the real dependency or removed before merging upstream if the maintainers prefer that.
2 changes: 1 addition & 1 deletion maigret/resources/data.json
Original file line number Diff line number Diff line change
Expand Up @@ -17537,7 +17537,7 @@
"method": "vimeo"
},
"headers": {
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3MzQxMTc1NDAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiNDc4Y2ZhZGUtZjI0Yy00MDVkLTliYWItN2RlNGEzNGM4MzI5In0.guN7Fg8dqq7EYdckrJ-6Rdkj_5MOl6FaC4YUSOceDpU"
"Authorization": "jwt eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE3NTk3MTkwMDAsInVzZXJfaWQiOm51bGwsImFwcF9pZCI6NTg0NzksInNjb3BlcyI6InB1YmxpYyIsInRlYW1fdXNlcl9pZCI6bnVsbCwianRpIjoiYmEzYTE0MDEtMTdkZS00ZGIxLTkzNjQtZGY1MDVkMzJkOWU1In0.GaK5Zn059lxEYy04lOq0eh9RCQWm4-a5uyNxfZKf6pg"
},
"urlProbe": "https://api.vimeo.com/users/{username}?fields=name%2Cgender%2Cbio%2Curi%2Clink%2Cbackground_video%2Clocation_details%2Cpictures%2Cverified%2Cmetadata.public_videos.total%2Cavailable_for_hire%2Ccan_work_remotely%2Cmetadata.connections.videos.total%2Cmetadata.connections.albums.total%2Cmetadata.connections.followers.total%2Cmetadata.connections.following.total%2Cmetadata.public_videos.total%2Cmetadata.connections.vimeo_experts.is_enrolled%2Ctotal_collection_count%2Ccreated_time%2Cprofile_preferences%2Cmembership%2Cclients%2Cskills%2Cproject_types%2Crates%2Ccategories%2Cis_expert%2Cprofile_discovery%2Cwebsites%2Ccontact_emails&fetch_user_profile=1",
"checkType": "status_code",
Expand Down
Empty file.
58 changes: 58 additions & 0 deletions maigret_sites_example/example_site.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
"""
Lightweight example checker module.

This is intentionally defensive:
- If Maigret's BaseChecker is importable, we subclass it.
- Otherwise the module still exposes a callable `check(nickname, user_agent=None)` function
so you can wire it into Maigret's registry manually if needed.

Adapt registration to Maigret internals (register this checker in their site registry).
"""
import requests

try:
from maigret.checker import BaseChecker # best-effort import; adapt if path differs
_HAS_BASE = True
except Exception:
BaseChecker = object
_HAS_BASE = False

class ExampleSiteChecker(BaseChecker):
site_name = "example_site"

def __init__(self, user_agent=None):
self.user_agent = user_agent or "maigret/extended (+https://github.com/dmoney96/maigretexpanded)"

def check(self, nickname):
url = f"https://www.example.com/{nickname}"
headers = {"User-Agent": self.user_agent}
try:
r = requests.get(url, headers=headers, timeout=10)
except Exception as e:
return {"status": "error", "error": str(e)}

if r.status_code == 404:
return {"status": "not_found"}

# JSON endpoint example
if "application/json" in r.headers.get("Content-Type", ""):
try:
data = r.json()
if data.get("profile") or data.get("exists"):
return {"status": "found", "url": url}
return {"status": "not_found"}
except Exception:
pass

# HTML heuristics
text = r.text.lower()
if "class=\"profile-header\"" in r.text or "data-user-id" in r.text or "profile not found" not in text:
# basic positive heuristic (refine for real sites)
return {"status": "found", "url": url}

return {"status": "unknown"}

# convenience function for non-class consumers
def check(nickname, user_agent=None):
c = ExampleSiteChecker(user_agent=user_agent)
return c.check(nickname)
50 changes: 50 additions & 0 deletions maigret_sites_example/load_extras.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# maigret_sites_example/load_extras.py
"""Load and merge an extras JSON file into a default sites dict.

Usage patterns:
- default_sites = load_extra_sites(default_sites) # runtime merge
- merged = merge_sites(default_sites, extra_path) # pure function

This file intentionally lives outside `maigret/` to avoid colliding with upstream modules.
"""
from pathlib import Path
import json
import os
from typing import Dict, Any

def read_json_path(path: str) -> Dict[str, Any]:
p = Path(path)
with p.open("r", encoding="utf-8") as fh:
return json.load(fh)

def merge_sites(default_sites: Dict[str, Any], extra_path: str) -> Dict[str, Any]:
"""
Return a new dict with keys from extra_path merged in only when they do not exist.
Non-destructive: does not overwrite existing keys.
"""
if not extra_path:
return default_sites
p = Path(extra_path)
if not p.exists():
return default_sites
try:
extra = read_json_path(extra_path)
except Exception:
# fail safe: return defaults if JSON invalid
return default_sites

merged = dict(default_sites) # shallow copy
for k, v in extra.items():
if k not in merged:
merged[k] = v
return merged

def load_extra_sites(default_sites: Dict[str, Any],
env_var: str = "MAIGRET_EXTRA_SITES",
cli_path: str | None = None) -> Dict[str, Any]:
"""
Merge extras into default_sites. CLI path (explicit) takes precedence over environment var.
"""
extra_path = cli_path or os.getenv(env_var)
return merge_sites(default_sites, extra_path)

75 changes: 75 additions & 0 deletions maigret_sites_example/mastodon_api_checker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
"""
Mastodon API-style checker (example).

Uses the helper resolve_mastodon_api() (in mastodon_api_resolver.py)
which probes instances using the Mastodon accounts lookup endpoint.
This checker reports a hit when the resolver returns {"status": "found"}.
"""
from typing import Dict, Any, Optional
import os

# import the resolver module (not the function) so tests that patch the
# function on the module will affect calls performed here.
from . import mastodon_api_resolver as resolver

DEFAULT_RANK = 120


def check(username: str, settings: Optional[object] = None, logger: Optional[object] = None, timeout: int = 6) -> Dict[str, Any]:
"""
Maigret-style checker for Mastodon-like handles.

Args:
username: input username (may be '@name', 'name@instance' or 'name')
settings, logger: optional compatibility parameters (not used here)
timeout: passed to resolver

Returns:
dict with keys at least: http_status, ids_usernames, parsing_enabled, rank, url, raw
"""
queried = username or ""
queried_stripped = queried.lstrip("@")

# allow overriding the instance to probe via env var
instance_hint = os.getenv("MAIGRET_MASTODON_INSTANCE")

try:
# call the resolver through the module so test patching works:
resolved = resolver.resolve_mastodon_api(queried, instance_hint=instance_hint, timeout=timeout)
except Exception as exc:
# Do not raise during checks — treat as not found; log if logger is present
if logger:
try:
logger.debug("mastodon resolver exception: %s", exc)
except Exception:
pass
resolved = {"status": "not_found"}

# Default not-found result
result: Dict[str, Any] = {
"http_status": None,
"ids_usernames": {},
"is_similar": False,
"parsing_enabled": False,
"rank": DEFAULT_RANK,
"url": None,
"raw": resolved,
}

if resolved.get("status") == "found":
# Extract canonical username (drop leading '@' and any instance part)
canon = queried_stripped.split("@", 1)[0]
result.update(
{
"http_status": 200,
"ids_usernames": {canon: "username"},
"is_similar": False,
# this checker provides a found profile URL / data so parsing_enabled = True
"parsing_enabled": True,
"rank": DEFAULT_RANK,
"url": resolved.get("url"),
"raw": resolved,
}
)

return result
52 changes: 52 additions & 0 deletions maigret_sites_example/mastodon_api_resolver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
"""Resolve a Mastodon account using instance REST API (acct lookup)."""
from typing import Optional, Dict
import requests

COMMON_INSTANCES = [
"mastodon.social",
"fosstodon.org",
"mstdn.social",
"chaos.social",
"mastodon.cloud",
]

DEFAULT_TIMEOUT = 6
USER_AGENT = "maigret-extended/1.0 (+https://github.com/yourname/maigretexpanded)"

def lookup_on_instance(nickname: str, instance: str, timeout: int = DEFAULT_TIMEOUT) -> Dict:
acct = nickname.lstrip("@").split("@")[0]
url = f"https://{instance}/api/v1/accounts/lookup"
params = {"acct": acct}
headers = {"User-Agent": USER_AGENT}
try:
r = requests.get(url, params=params, timeout=timeout, headers=headers)
if r.status_code == 200:
try:
j = r.json()
except Exception:
return {"status": "not_found"}
profile_url = j.get("url") or f"https://{instance}/@{acct}"
return {"status": "found", "url": profile_url, "data": j}
elif r.status_code in (404, 410):
return {"status": "not_found"}
else:
return {"status": "not_found", "raw_status": r.status_code}
except requests.RequestException:
return {"status": "not_found"}

def resolve_mastodon_api(nickname: str, instance_hint: Optional[str] = None, timeout: int = DEFAULT_TIMEOUT) -> Dict:
candidates = []
if "@" in nickname and nickname.lstrip("@").count("@") == 1 and nickname.lstrip("@").split("@",1)[1]:
user, inst = nickname.lstrip("@").split("@",1)
candidates.append((user, inst))
elif instance_hint:
candidates.append((nickname.lstrip("@"), instance_hint))
else:
for inst in COMMON_INSTANCES:
candidates.append((nickname.lstrip("@"), inst))

for user, inst in candidates:
resp = lookup_on_instance(user, inst, timeout=timeout)
if resp.get("status") == "found":
return resp
return {"status": "not_found"}
29 changes: 29 additions & 0 deletions maigret_sites_example/mastodon_resolver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# maigret_sites_example/mastodon_resolver.py
import requests

COMMON_INSTANCES = ["mastodon.social", "fosstodon.org", "mstdn.social", "chaos.social"]

def resolve_mastodon(nickname, instance_hint=None, timeout=6):
candidates = []
if "@" in nickname:
# accept nickname@instance
user, inst = nickname.lstrip("@").split("@",1)
candidates.append((user, inst))
elif instance_hint:
candidates.append((nickname, instance_hint))
else:
for inst in COMMON_INSTANCES:
candidates.append((nickname, inst))

for user, inst in candidates:
url = f"https://{inst}/@{user}"
try:
r = requests.get(url, timeout=timeout, headers={"User-Agent":"maigret/extended"})
if r.status_code == 200:
return {"status":"found","url":url}
if r.status_code == 404:
continue
except Exception:
continue
return {"status":"not_found"}

26 changes: 26 additions & 0 deletions maigret_sites_example/newsite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""
Simple example checker for NewSite.
Implements check(username, settings=None, logger=None, timeout=6) -> dict
Keep network calls mocked in tests — this file uses requests normally.
"""
import requests

def check(username, settings=None, logger=None, timeout=6):
url = f"https://newsite.com/{username}"
try:
r = requests.get(url, timeout=timeout, headers={"User-Agent":"maigretexpanded/0.1"})
except Exception as e:
if logger:
logger.debug("newsite network error: %s", e)
return {"http_status": None, "ids_usernames": {}, "is_similar": False, "parsing_enabled": True, "rank": 999, "url": url}

found = (r.status_code == 200)
ids = {username: "username"} if found else {}
return {
"http_status": r.status_code,
"ids_usernames": ids,
"is_similar": False,
"parsing_enabled": True,
"rank": 100,
"url": url,
}
Loading
Loading