English ·
Deutsch
Crawls websites and generates standard-compliant sitemap.xml files. Uses Playwright for JavaScript rendering or httpx for fast HTTP crawling.
Linux / macOS:
curl -fsSL https://raw.githubusercontent.com/michaelblaess/sitemap-tracker/main/install.sh | bashWindows (PowerShell):
irm https://raw.githubusercontent.com/michaelblaess/sitemap-tracker/main/install.ps1 | iex# Simple crawl (httpx mode, fast)
sitemap-tracker https://example.com
# With JavaScript rendering (Playwright)
sitemap-tracker https://example.com --render
# Save sitemap directly
sitemap-tracker https://example.com --output sitemap.xml
# Limit crawl depth
sitemap-tracker https://example.com --max-depth 5
# More concurrency
sitemap-tracker https://example.com --concurrency 16
# Ignore robots.txt
sitemap-tracker https://example.com --ignore-robots
# With cookies (e.g. for login)
sitemap-tracker https://example.com --cookie session=abc123| Parameter | Description | Default |
|---|---|---|
URL |
Start URL of the website | - |
--output, -o |
Output path for sitemap.xml | sitemap_<host>_<timestamp>.xml |
--max-depth, -d |
Maximum crawl depth | 10 |
--concurrency, -c |
Parallel requests | 8 |
--timeout, -t |
Timeout per page (seconds) | 30 |
--render |
Render JavaScript with Playwright | off |
--no-headless |
Browser visible (debugging) | off |
--ignore-robots |
Ignore robots.txt | off |
--user-agent |
Custom User-Agent | Chrome 131 |
--cookie |
Set cookie (NAME=VALUE, multiple) | - |
| Key | Function |
|---|---|
c |
Start crawl (URL dialog) |
x |
Cancel crawl / JSON error report |
m |
Save sitemap (save-as dialog) |
s |
Settings |
g |
Export form report (JSON) |
j |
JIRA table to clipboard |
e |
Show errors only |
f |
Sitemap diff |
d |
Copy URL details |
l |
Toggle log |
h |
History |
z |
Crawl summary (score, findings) |
? |
HTTP status code reference |
i |
Info dialog |
q |
Quit |
Copying / exporting the log runs via right-click on the log panel. Hovering a shortcut in the footer reveals a full tooltip explaining what it does. URLs in the log, header and detail panel are clickable without holding Ctrl.
-
Dual mode: httpx (fast, HTML only) or Playwright (JavaScript rendering)
-
robots.txt: Respected by default,
--ignore-robotsto disable -
Auto-split: With >50,000 URLs, an automatic sitemap index with partial sitemaps
-
Priority: Automatically based on crawl depth (home page = 1.0)
-
lastmod: From HTTP Last-Modified header
-
URL normalization: Duplicates avoided through normalization
-
Redirect-target resolution (same-host aliases): A sitemap normally lists only crawled HTTP-200 HTML pages — pure redirects (301/302) are left out, because they have no content of their own. But there is one special case: when a redirect points to a page on the same host that is not otherwise listed (an internal alias, e.g.
/old-path→/new-pathwhere/new-pathis only reachable via the redirect), that resolved target would fall through the cracks and never end up in the sitemap. The writer therefore writes the resolved target into the sitemap instead of silently dropping the redirect. Three rules keep this safe and standard-compliant:- Already covered → skipped. If the redirect resolves to a page that is already listed as a 200 page, it is dropped — scanning it again would just duplicate the existing page.
- Same target → one entry. Several redirects pointing at the same target collapse into a single sitemap entry.
- Same host only. Only targets on the exact same host are added. A
sitemap.xmlmay, per the sitemaps.org standard, only contain URLs from the host it lives on — so redirects to sibling subdomains (e.g.www.waerme.enviam.deredirecting towww.enviam.deorwww.solar.enviam.de) or to foreign domains are excluded. They belong in the sitemap of their own site, not this one. If you want those pages scanned, crawl that host directly.
Note: the resolved target's final HTTP status is not re-fetched (the crawler only stored the original 301), so a broken same-host alias target shows up as an error in the downstream scan — which is exactly what you want to see.
-
Form detection:
<form>tags are detected, marked in the table and exportable as JSON -
Dead-link source viewer: For every 4xx/5xx page, jump to the referring page's HTML and see the exact line that contains the broken link — Pygments-highlighted, the match line painted in a warm gold band. From there: open the source in your browser, copy a paste-ready snippet (broken URL + ±3 lines of context + line number) to the clipboard, or save the full HTML as evidence
-
Results context menu: Right-click on any results row for the five bulk actions (toggle errors-only filter, save sitemap as XML, save error report as JSON, copy JIRA table, generate forms report). 4xx/5xx rows additionally get a one-click jump into the broken-link source viewer
-
Live TUI: Progress, statistics and URL details in real time — results table and page tree split across two tabs
-
Sortable results: Click any column header (Status, HTTP, Depth, Links, Form, Time, Size, Date, URL) to sort — second click reverses. Active column gets a ▲/▼ marker. Hovering the "Links" header shows a tooltip clarifying it counts the internal links found on the page
-
Crawl summary: When a crawl finishes, a modal shows a site score (0-100 %, share of error-free pages, with an A-F grade) and a findings table (pages crawled / error-free / with errors, HTTP status breakdown, internal links found, URLs in the sitemap and how many of those are resolved redirect targets — see "Redirect-target resolution" above). A "Save sitemap" button sits right next to "Close". Re-open it any time with
z -
HTTP status code reference: Press
?for a quick lookup table of all common HTTP status codes, grouped by class (2xx/3xx/4xx/5xx) with meaning and explanation - handy to tell a 301 from a 307 -
Proxy/SSO detection: Before crawling, the start URL is probed once. If it is redirected to a foreign domain (typical of a proxy, an auth gateway such as Zscaler, or an SSO login like E.ON/Microsoft), a warning modal explains the situation and the crawl is aborted - instead of silently returning just the seed URLs. Redirects within the same registrable domain (e.g. alias 301s to a sibling subdomain) do not trigger it
-
Date & size columns: Last-Modified date and page size are shown directly in the results table, side by side with the URL. Note on size: the "Size" value is the HTML document size (the delivered source). While crawling, resources such as images, CSS, JS and fonts are not loaded, so they do not count towards it - the real total page weight (as a browser reports it) is higher. Hovering the "Size" header explains this
-
Clickable links: URLs in the log, crawl header and detail panel open in your default browser on a single click (no Ctrl required); local result files (sitemap.xml, JSON reports) open in the OS default app
-
Page tree: Hierarchical view of all crawled URLs with HTTP status, dead-link and not-in-sitemap markers — embedded as a tab, siblings sorted alphabetically; the table's filter applies to the tree as well (matching nodes plus their ancestors stay visible)
-
URL dialog:
copens a dialog (pre-filled with the last URL) to enter or change the target URL — no restart needed -
Crawl header: All crawl statistics — mode, robots.txt, concurrency, status codes, progress — grouped in one collapsible header
-
Page details: Selecting a URL shows grouped panels — page info, issues, tech stack, SEO/meta data and HTTP headers
-
Footer tooltips: Every shortcut shows a hover-tooltip explaining what it does — even the cryptic ones like JIRA table, sitemap diff or form report
-
Issue detection: Flags common problems per page — HTTP errors, missing/overlong title & description, missing H1/viewport/canonical,
noindex, slow load, large page -
Tech-stack detection: Detects the CMS, JS/CSS frameworks and server software of each page
-
Page preview: Optional in-terminal screenshot of the selected page (TGP/Sixel with half-block fallback) — toggle in settings. Before the shot it accepts the cookie consent, waits for the network to settle and triggers lazy-loaded images, so hero images render instead of being cut off. A live phase indicator (loading page, accepting consent, loading images, capturing) explains why it can take 2-3 seconds
-
Save-as dialog:
mopens a file dialog to choose where to write thesitemap.xml, pre-filled with a suggested name and remembering the last folder used -
Resizable panels: Splitters to freely resize the URL table, log and stats panels
-
Log panel: Right-click context menu — copy, export to file, or hide
-
Settings dialog: Language, robots.txt, Playwright, page preview, concurrency, timeout and crawl depth — persisted across runs
-
Filter with history: Filter the URL table by URL/status; recent filter terms in a dropdown
-
Crawl history: Past crawls with date, URL, parameters and final stats (crawled / 2xx / errors); date in the UI's locale (DE:
dd.MM.yyyy, EN: ISO). After picking a URL from the history, the footerc(crawl) key blinks to signal you are ready to start
- System Chrome preferred (faster startup, less memory)
- Bundled Chromium as fallback (included in standalone installation)
Important: Crawling a website may be perceived as unusual traffic by the operator. Please note:
- Inform the website operator before crawling, especially for large websites
- Respect
robots.txt(enabled by default) - Use reasonable concurrency and timeout values
- This tool is intended for your own websites and authorized analyses
git clone https://github.com/michaelblaess/sitemap-tracker.git
cd sitemap-tracker
# Windows
.\bootstrap.ps1
# Linux/macOS
./bootstrap.sh# Windows
.\run.ps1 https://example.com
# Linux/macOS
./run.sh https://example.comgit tag vX.Y.Z
git push origin vX.Y.ZGitHub Actions automatically builds executables for Windows, Linux and macOS.
Apache License 2.0 - see LICENSE