Skip to content

michaelblaess/sitemap-tracker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

139 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sitemap Tracker logo

Sitemap Tracker

English · Deutsch


Stars Forks Issues Pull Requests

Last Commit License Python

Crawls websites and generates standard-compliant sitemap.xml files. Uses Playwright for JavaScript rendering or httpx for fast HTTP crawling.

Screenshots

Main View

Main View

Sitemap Tree

Sitemap Tree

Crawl History

Crawl History

Installation

One-Liner (Standalone, no Python required)

Linux / macOS:

curl -fsSL https://raw.githubusercontent.com/michaelblaess/sitemap-tracker/main/install.sh | bash

Windows (PowerShell):

irm https://raw.githubusercontent.com/michaelblaess/sitemap-tracker/main/install.ps1 | iex

Usage

# Simple crawl (httpx mode, fast)
sitemap-tracker https://example.com

# With JavaScript rendering (Playwright)
sitemap-tracker https://example.com --render

# Save sitemap directly
sitemap-tracker https://example.com --output sitemap.xml

# Limit crawl depth
sitemap-tracker https://example.com --max-depth 5

# More concurrency
sitemap-tracker https://example.com --concurrency 16

# Ignore robots.txt
sitemap-tracker https://example.com --ignore-robots

# With cookies (e.g. for login)
sitemap-tracker https://example.com --cookie session=abc123

CLI Parameters

Parameter Description Default
URL Start URL of the website -
--output, -o Output path for sitemap.xml sitemap_<host>_<timestamp>.xml
--max-depth, -d Maximum crawl depth 10
--concurrency, -c Parallel requests 8
--timeout, -t Timeout per page (seconds) 30
--render Render JavaScript with Playwright off
--no-headless Browser visible (debugging) off
--ignore-robots Ignore robots.txt off
--user-agent Custom User-Agent Chrome 131
--cookie Set cookie (NAME=VALUE, multiple) -

Keyboard Shortcuts (TUI)

Key Function
c Start crawl (URL dialog)
x Cancel crawl / JSON error report
m Save sitemap (save-as dialog)
s Settings
g Export form report (JSON)
j JIRA table to clipboard
e Show errors only
f Sitemap diff
d Copy URL details
l Toggle log
h History
z Crawl summary (score, findings)
? HTTP status code reference
i Info dialog
q Quit

Copying / exporting the log runs via right-click on the log panel. Hovering a shortcut in the footer reveals a full tooltip explaining what it does. URLs in the log, header and detail panel are clickable without holding Ctrl.

Features

  • Dual mode: httpx (fast, HTML only) or Playwright (JavaScript rendering)

  • robots.txt: Respected by default, --ignore-robots to disable

  • Auto-split: With >50,000 URLs, an automatic sitemap index with partial sitemaps

  • Priority: Automatically based on crawl depth (home page = 1.0)

  • lastmod: From HTTP Last-Modified header

  • URL normalization: Duplicates avoided through normalization

  • Redirect-target resolution (same-host aliases): A sitemap normally lists only crawled HTTP-200 HTML pages — pure redirects (301/302) are left out, because they have no content of their own. But there is one special case: when a redirect points to a page on the same host that is not otherwise listed (an internal alias, e.g. /old-path/new-path where /new-path is only reachable via the redirect), that resolved target would fall through the cracks and never end up in the sitemap. The writer therefore writes the resolved target into the sitemap instead of silently dropping the redirect. Three rules keep this safe and standard-compliant:

    1. Already covered → skipped. If the redirect resolves to a page that is already listed as a 200 page, it is dropped — scanning it again would just duplicate the existing page.
    2. Same target → one entry. Several redirects pointing at the same target collapse into a single sitemap entry.
    3. Same host only. Only targets on the exact same host are added. A sitemap.xml may, per the sitemaps.org standard, only contain URLs from the host it lives on — so redirects to sibling subdomains (e.g. www.waerme.enviam.de redirecting to www.enviam.de or www.solar.enviam.de) or to foreign domains are excluded. They belong in the sitemap of their own site, not this one. If you want those pages scanned, crawl that host directly.

    Note: the resolved target's final HTTP status is not re-fetched (the crawler only stored the original 301), so a broken same-host alias target shows up as an error in the downstream scan — which is exactly what you want to see.

  • Form detection: <form> tags are detected, marked in the table and exportable as JSON

  • Dead-link source viewer: For every 4xx/5xx page, jump to the referring page's HTML and see the exact line that contains the broken link — Pygments-highlighted, the match line painted in a warm gold band. From there: open the source in your browser, copy a paste-ready snippet (broken URL + ±3 lines of context + line number) to the clipboard, or save the full HTML as evidence

  • Results context menu: Right-click on any results row for the five bulk actions (toggle errors-only filter, save sitemap as XML, save error report as JSON, copy JIRA table, generate forms report). 4xx/5xx rows additionally get a one-click jump into the broken-link source viewer

  • Live TUI: Progress, statistics and URL details in real time — results table and page tree split across two tabs

  • Sortable results: Click any column header (Status, HTTP, Depth, Links, Form, Time, Size, Date, URL) to sort — second click reverses. Active column gets a ▲/▼ marker. Hovering the "Links" header shows a tooltip clarifying it counts the internal links found on the page

  • Crawl summary: When a crawl finishes, a modal shows a site score (0-100 %, share of error-free pages, with an A-F grade) and a findings table (pages crawled / error-free / with errors, HTTP status breakdown, internal links found, URLs in the sitemap and how many of those are resolved redirect targets — see "Redirect-target resolution" above). A "Save sitemap" button sits right next to "Close". Re-open it any time with z

  • HTTP status code reference: Press ? for a quick lookup table of all common HTTP status codes, grouped by class (2xx/3xx/4xx/5xx) with meaning and explanation - handy to tell a 301 from a 307

  • Proxy/SSO detection: Before crawling, the start URL is probed once. If it is redirected to a foreign domain (typical of a proxy, an auth gateway such as Zscaler, or an SSO login like E.ON/Microsoft), a warning modal explains the situation and the crawl is aborted - instead of silently returning just the seed URLs. Redirects within the same registrable domain (e.g. alias 301s to a sibling subdomain) do not trigger it

  • Date & size columns: Last-Modified date and page size are shown directly in the results table, side by side with the URL. Note on size: the "Size" value is the HTML document size (the delivered source). While crawling, resources such as images, CSS, JS and fonts are not loaded, so they do not count towards it - the real total page weight (as a browser reports it) is higher. Hovering the "Size" header explains this

  • Clickable links: URLs in the log, crawl header and detail panel open in your default browser on a single click (no Ctrl required); local result files (sitemap.xml, JSON reports) open in the OS default app

  • Page tree: Hierarchical view of all crawled URLs with HTTP status, dead-link and not-in-sitemap markers — embedded as a tab, siblings sorted alphabetically; the table's filter applies to the tree as well (matching nodes plus their ancestors stay visible)

  • URL dialog: c opens a dialog (pre-filled with the last URL) to enter or change the target URL — no restart needed

  • Crawl header: All crawl statistics — mode, robots.txt, concurrency, status codes, progress — grouped in one collapsible header

  • Page details: Selecting a URL shows grouped panels — page info, issues, tech stack, SEO/meta data and HTTP headers

  • Footer tooltips: Every shortcut shows a hover-tooltip explaining what it does — even the cryptic ones like JIRA table, sitemap diff or form report

  • Issue detection: Flags common problems per page — HTTP errors, missing/overlong title & description, missing H1/viewport/canonical, noindex, slow load, large page

  • Tech-stack detection: Detects the CMS, JS/CSS frameworks and server software of each page

  • Page preview: Optional in-terminal screenshot of the selected page (TGP/Sixel with half-block fallback) — toggle in settings. Before the shot it accepts the cookie consent, waits for the network to settle and triggers lazy-loaded images, so hero images render instead of being cut off. A live phase indicator (loading page, accepting consent, loading images, capturing) explains why it can take 2-3 seconds

  • Save-as dialog: m opens a file dialog to choose where to write the sitemap.xml, pre-filled with a suggested name and remembering the last folder used

  • Resizable panels: Splitters to freely resize the URL table, log and stats panels

  • Log panel: Right-click context menu — copy, export to file, or hide

  • Settings dialog: Language, robots.txt, Playwright, page preview, concurrency, timeout and crawl depth — persisted across runs

  • Filter with history: Filter the URL table by URL/status; recent filter terms in a dropdown

  • Crawl history: Past crawls with date, URL, parameters and final stats (crawled / 2xx / errors); date in the UI's locale (DE: dd.MM.yyyy, EN: ISO). After picking a URL from the history, the footer c (crawl) key blinks to signal you are ready to start

Browser Strategy

  1. System Chrome preferred (faster startup, less memory)
  2. Bundled Chromium as fallback (included in standalone installation)

Privacy

Important: Crawling a website may be perceived as unusual traffic by the operator. Please note:

  • Inform the website operator before crawling, especially for large websites
  • Respect robots.txt (enabled by default)
  • Use reasonable concurrency and timeout values
  • This tool is intended for your own websites and authorized analyses

Development

Setup

git clone https://github.com/michaelblaess/sitemap-tracker.git
cd sitemap-tracker

# Windows
.\bootstrap.ps1

# Linux/macOS
./bootstrap.sh

Local Start

# Windows
.\run.ps1 https://example.com

# Linux/macOS
./run.sh https://example.com

Creating a Release

git tag vX.Y.Z
git push origin vX.Y.Z

GitHub Actions automatically builds executables for Windows, Linux and macOS.

License

Apache License 2.0 - see LICENSE

About

Crawls websites and generates standards-compliant sitemap.xml files. Supports Playwright for JS rendering and httpx for fast HTTP crawling

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors