Website Change Detector

This Python script checks whether a list of URLs have changed since the last snapshot. It is useful for monitoring web pages for updates, ignoring common sources of noise such as ad tracking parameters and dynamic hidden fields.

Features

Loads URLs from pages.txt (one per line, lines starting with # are ignored as comments)
Supports per-URL ignore patterns: specify a filename after a semicolon to remove custom patterns from the page before comparison (see the ignore/ directory)
Fetches each URL and cleans the HTML to remove noise (e.g., ad/tracking query parameters, dynamic hidden fields, scripts, comments)
Computes a checksum of the cleaned content
Compares with previously saved checksums in checksum.csv
Prints a list of URLs that have changed since the last snapshot
Saves updated checksums and change dates for future runs
Saves a snapshot of the cleaned HTML for each changed page in the snapshots/ directory, along with a log of changes, for debugging purposes.

Usage

Install dependencies
Make sure you have Python 3 and install required packages:
```
pip install requests
```
Prepare pages.txt
List the URLs you want to monitor, one per line. Lines starting with # are treated as comments.
- To use custom ignore patterns for a URL, add a semicolon and the ignore filename (e.g., https://example.com;example.com).
- Place your ignore pattern files in the ignore/ directory, one regex per line.
Run the script
```
python main.py
```
View results
- Changed URLs will be printed to the console.
- A log of changes is kept in snapshots/snapshots.log.

How it works

The script fetches each URL and removes common sources of noise (such as ad image query parameters and dynamic hidden fields) before calculating a checksum.
If an ignore pattern file is specified for a URL, all matching patterns are removed from the page before comparison.
If the checksum differs from the previous run, the URL is reported as changed and the date is updated.
When a change is detected, the cleaned HTML is saved as a snapshot in the snapshots/ directory, and the change is logged.

Customization

You can adjust the clean_html function in main.py to further refine what is considered "noise" for your use case.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
ignore		ignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pages.txt		pages.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Change Detector

Features

Usage

How it works

Customization

About

Uh oh!

Releases

Packages

Languages

License

epklein/website-change-detector

Folders and files

Latest commit

History

Repository files navigation

Website Change Detector

Features

Usage

How it works

Customization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages