Skip to content

DoctorGoz/ghprowl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ghprowl — GitHub Prowler

A multi-target watcher for credentials and internal references that leak into public GitHub. Engineers commit tokens to sample repos, paste connection strings into gists, and push dotfiles that point at internal hosts. ghprowl watches many programs at once, finds those leaks the moment they appear, and pings you — without ever moving the secret off your machine.

It is config-driven: a generic engine in lib/, one small config.env + marker catalog per target. The engine ships in this repo; per-target data never does (see Hygiene).

Built for authorized bug-bounty / security research. Public data only. See Scope & ethics.

The design story: Rare, Not Random — the reasoning behind the method, the engineering tradeoffs, and an A/B against a hand-tuned baseline (more recall, a third of the clone load).


Why it works

Three ideas do most of the lifting.

1. Rare, not random

The naive way to find a program's leaks is to search GitHub for its name. That drowns you in tutorials and forks. ghprowl instead builds a marker catalog — endpoints, internal hostnames, env-var names, custom headers, proto packages, SDK module names — and ranks every marker by its global code-search frequency, then queries the rarest first.

A marker with a global count of 0–10 is a near-perfect signal: almost the only code that contains internal-gateway.corp.example is code that shouldn't be public. A marker with a count of 50k (api_key, AUTH_TOKEN, a vendor's public SDK URL) is noise. Rarity is precision.

This applies to hand-added markers too — "distinctive" is not the same as "rare." A public SDK repository URL or a session-cookie name is distinctive to a target yet appears in every integrator's build config. Rank everything; trust the rare.

2. Two-depth ledger

Watching every candidate by cloning it doesn't scale. ghprowl splits findings into two tiers:

  • deep — high-confidence (a rare-marker hit). Cloned and gitleaks-scanned every cycle.
  • light — speculative (a name-pattern match, a contributor, a fork). Tracked only; promoted to deep automatically when it later trips a marker (sweep).

A third state, confidence none, quarantines squatters and recon dumps so they never waste a clone. The result: a wide net with a small, high-signal clone set. Most light entries are tripwires — empty or unrelated accounts whose value is the day one of them fat-fingers a push.

3. The escape hatch

Auto-onboarding (setup) derives markers from public sources — program scope, org-repo READMEs, on-disk recon. That reliably reconstructs a target's service inventory. It cannot reconstruct the non-public gold: a client_id captured from traffic, an internal API-gateway id, a proto package name. Those are the highest-signal markers, and they are added by hand as manual markers, which are always queried regardless of the per-run budget. The escape hatch is mandatory, not optional — but its markers get rarity-ranked like everything else.


Install

Requirements: gh (authenticated), gitleaks, jq, curl. Optional: qrencode (for topics qr), sqlite3 + a HackerOne scope cache (richer setup).

git clone https://github.com/DoctorGoz/ghprowl ~/.ghprowl
ln -s ~/.ghprowl/bin/ghprowl ~/.local/bin/ghprowl    # put it on PATH
ghprowl list

The clone is the install: ~/.ghprowl is the working tree, and your per-target data lives inside it under targets/<handle>/ (gitignored).


Quickstart

ghprowl setup acme-corp        # onboard: derive scope -> markers -> FIT check -> DRAFT config (stops for review)
#                              # REVIEW targets/acme-corp/{config.env,markers.tsv}; add manual gold markers
ghprowl discover acme-corp     # widen the net: rare-marker code-search + pivots, tiered into the ledger
ghprowl watch acme-corp        # detect: clone+gitleaks the deep set, alert on a LIVE token
ghprowl status                 # dashboard across all targets

setup deliberately stops for review — it produces a draft, never a live config, and reports a FIT verdict (if a target has no meaningful public GitHub footprint, it says so rather than running a dead watcher). That review step — prune noisy markers, paste in your non-public gold — is where the precision comes from.


Commands

Command What it does
setup <handle> Onboard a target: derive scope → markers → FIT check → draft config.env + catalog. Stops for review.
rank <target> (Re)rank the marker catalog by global rarity, rarest first.
discover <target> Widen the net: rare-marker code-search + contributor/fork pivots, tiered into the ledger.
sweep <target> [N] Promotion pass: light → deep when a tracked entity now trips a marker.
watch <target> Detection: clone/refresh the deep set, gitleaks-scan changed repos, alert on a LIVE hit.
status [target] No-network dashboard: per-target tiers, markers, alerts, last run.
topics [target] List each target's ntfy topic (the subscribe key) + URL.
topics qr <target> Render the subscribe URL as a scannable terminal QR (no phone typing).
topics test <target> Send a harmless confirmation ping to verify a subscription end-to-end.
list List configured targets.

watch, discover, and sweep accept --all instead of a target to run every configured target — the entry point for cron.


Alerting

A hit is announced three ways, and the secret never leaves the host:

  • ntfy push — a heads-up only ("Actionable hit in <repo>. Check the local file."). The token is never in the message. Each target gets a unique auto-minted topic; subscribe once (ghprowl topics qr <target> → scan).
  • URGENT-LIVE-TOKEN.txt — local, LIVE hits only, with the repo, commit URL, and the exact git show command to view the secret offline.
  • alerts/ALERTS.md — local append-only log of every hit.

A baseline (first) scan of a repo suppresses historical findings and alerts only on live tokens, so onboarding a target doesn't bury you. watch exits non-zero on a hit for cron-side handling.


Per-target config (targets/<handle>/config.env)

Key Purpose
GP_GH_ORGS Org(s) that root discovery and are watched directly.
GP_DOMAINS In-scope domains (host markers + issuer derivation).
GP_ISSUER_SUBSTR JWT iss substrings that mark a token as this target's (the live check).
GP_SOFT_KEYWORDS Broad relevance keywords (issuer/code detection).
GP_SWEEP_KEYWORDS Curated brand-coined subset for the name-sweep + interest test. Prune generic words here.
GP_NAME_PATTERNS Work-account login patterns to sweep (e.g. *-acme / acme-*).
GP_FORK_PIVOT 0 by default; 1 enables bounded fork enumeration — only sane on small targets.
GP_NTFY_TOPIC Alert topic (auto-minted; the token itself never transmitted).
GP_MARKER_BUDGET How many of the rarest markers to spend the code-search budget on per discover.
GP_MAX_REPO_MB watch skips repos larger than this (default 500) so one monorepo can't wedge a cycle.

The marker catalog (markers.tsv → ranked markers-ranked.tsv) carries kinds like endpoint, internal-host, env-var, header, proto-pkg, and manual (your hand-added gold).


How a watch cycle works

For each repo in the deep set (org repos + marker-hit leaks):

  1. Big-repo guard — check GitHub's reported size first; skip + warn if over GP_MAX_REPO_MB (override a specific repo via force-big.txt). Unattended cron can't prompt, so the design is skip-and-warn, not ask.
  2. Clone or fetch; skip unchanged repos via a per-repo commit-set checkpoint.
  3. gitleaks full-history scan → an issuer-aware post-filter keeps only target-relevant secrets.
  4. Alert on a live hit (and only live hits on a baseline scan).

A concurrency flock per operation means a long cycle and the next cron tick never collide.


Cron

*/30 * * * *  ~/.local/bin/ghprowl watch --all     >/dev/null 2>&1   # detect
17 5   * * *  ~/.local/bin/ghprowl discover --all   >/dev/null 2>&1   # widen
37 */6 * * *  ~/.local/bin/ghprowl sweep --all      >/dev/null 2>&1   # promote

Scope & ethics

ghprowl is a defensive / authorized-research tool, built to operate within those bounds:

  • Public data only. It reads public repositories and public code search. Nothing else.
  • Offline verification. Detection runs locally; a found secret is reported, never used.
  • The token never moves. Push notifications say "check the host"; the secret stays on disk.
  • Logins are pivots, not people. Account handles are search leads, not targets for profiling.

Use it on programs you are authorized to test, and follow each program's disclosure rules.


Repo layout & hygiene

bin/ghprowl            # thin dispatcher
lib/                   # generic engine (target-agnostic)
targets/_template/     # the only target dir that ships
targets/<handle>/      # YOUR per-target data — gitignored, never committed

Everything operator-specific (scope, markers, ntfy topics, ledgers, clones, alerts) lives under targets/<handle>/ and is gitignored. Before committing, confirm nothing slipped through:

git diff --cached --name-only | grep -v _template   # should list only engine/docs files

License

MIT — see LICENSE.

About

Multi-target watcher for credentials and internal references leaked to public GitHub -- rare-marker ranking, two-depth ledger, gitleaks detection. Public-data, authorized-research only.

Topics

Resources

License

Stars

Watchers

Forks

Contributors