A comprehensive, production-ready web crawler built in Rust to analyze and improve website quality.
RustCrawler provides three types of website analysis with multiple output formats:
Analyzes search engine optimization aspects:
- Title tag presence and length
- Meta description tags
- H1 heading tags
- Canonical URL tags
- Robots meta tags
- Internal link validation (configurable limit)
Evaluates website performance metrics:
- Response time measurement
- Page size analysis
- External resource counting (scripts, stylesheets)
- Compression detection (Brotli, Gzip, Deflate)
Checks web accessibility standards:
- HTML lang attribute
- Image alt attributes
- ARIA landmarks and attributes
- Semantic HTML5 tags
- Form label associations
- Skip navigation links
- Terminal: Color-coded, human-readable output
- JSON: Machine-readable format for integration
- HTML: Styled report for sharing
- Docker
- Make
Note: Rust and Cargo are NOT required on your host machine. They are included in the Docker container.
First, build the Docker image with the latest Rust version:
make installThis command downloads and sets up a Docker container with the latest version of Rust.
make run # Run in debug mode
make run-release # Run in release mode
make run-release # Run in release modeFollows an interactive prompt to select URL and crawlers.
# Analyze a URL with all crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --all
# Run specific crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --seo --performance
# Generate JSON report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format json --output report.json
# Generate HTML report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format html --output report.html
# Use custom configuration
docker run --rm rustcrawler cargo run -- --url https://example.com --all --config config.json
# Override settings
docker run --rm rustcrawler cargo run -- --url https://example.com --all --timeout 60 --max-links 20--url <URL>: URL to analyze--seo: Run SEO crawler--performance: Run Performance crawler--a11y: Run A11Y crawler--all: Run all crawlers--format <terminal|json|html>: Output format (default: terminal)--output <FILE>: Output file for JSON/HTML--config <FILE>: Configuration file path--timeout <SECONDS>: Request timeout--max-links <N>: Maximum internal links to check
Create a config.json:
{
"timeout_secs": 30,
"max_links_to_check": 10,
"user_agent": "RustCrawler/0.1.0",
"follow_redirects": true,
"max_redirects": 5
}All commands run inside the Docker container, so you don't need Rust installed locally.
make build # Build in debug mode
make build-release # Build in release modemake test # Run all tests (17 tests)
make test-verbose # Run tests with verbose outputmake format # Format code with rustfmt
make format-check # Check formatting without modifying filesmake lint # Run clippy linter
make check # Check if code compilesmake shell # Open a shell in the Docker containermake clean # Remove build artifacts and Docker imagemake help # Display all available targetsThe project uses the following main dependencies:
reqwest- HTTP client for making requestsurl- URL parsing and validationcolored- Terminal color outputtokio- Async runtimethiserror- Custom error typesserde/serde_json- Serialization for JSON outputclap- Command-line argument parsingchrono- Date/time handling for reports
The project follows Rust best practices with a modular architecture:
RustCrawler/
βββ src/
β βββ main.rs # Application entry point with CLI
β βββ lib.rs # Library root with public exports
β βββ cli.rs # CLI argument definitions
β βββ config.rs # Configuration management
β βββ error.rs # Custom error types
β βββ config.rs # Configuration management
β βββ error.rs # Custom error types
β βββ client.rs # HTTP client wrapper
β βββ models.rs # Data models and validation
β βββ output.rs # JSON/HTML report generation
β βββ utils.rs # Utility functions for I/O and display
β βββ crawlers/
β βββ mod.rs # Crawler trait and common functions
β βββ seo.rs # SEO crawler implementation
β βββ performance.rs # Performance crawler implementation
β βββ a11y.rs # Accessibility crawler implementation
βββ Cargo.toml # Rust dependencies and project configuration
βββ Dockerfile # Docker container setup
βββ Makefile # Build and run commands
βββ ARCHITECTURE.md # Detailed architecture documentation
βββ README.md # This file
- Modular Design: Each crawler is implemented in its own module with the
Crawlertrait - Separation of Concerns: HTTP client, models, configuration, and utilities are separate modules
- Error Handling: Custom error types using
thiserrorfor better error messages - Configuration: Externalized configuration with JSON file support
- CLI + Interactive: Supports both command-line and interactive modes
- Multiple Outputs: Terminal, JSON, and HTML report formats
- Testable: 17 unit tests covering all major functionality
- Extensible: Easy to add new crawlers by implementing the
Crawlertrait - Type Safety: Strong typing with custom models for data structures
- Library + Binary: Can be used as a library or standalone application
When contributing to this project:
- Ensure your code builds with
make build - Run tests with
make test(17 tests should pass) - Format code with
make format - Check for linting issues with
make lint - Follow Rust naming conventions and best practices
- Add tests for new functionality
- β
Custom error types with
thiserror - β Configuration management (JSON file support)
- β
CLI with
clapfor non-interactive use - β JSON and HTML output formats
- β Configurable timeouts and limits
- β User-agent customization
- β Redirect policy configuration
- β 17 comprehensive unit tests
- Async/await for parallel crawling
- HTML parser (
scrapercrate) for more accurate analysis - Integration tests with mock servers
- Sitemap crawling
- Rate limiting
- Retry logic with exponential backoff