Skip to content

Conversation

Mearman
Copy link
Member

@Mearman Mearman commented Jul 30, 2025

Summary

Implements comprehensive web page to markdown conversion functionality as requested in issue #38. This adds a powerful clip command to the markmv CLI that can extract content from various types of web pages using multiple intelligent strategies.

Key Features

🚀 Multi-Strategy Content Extraction

  • Auto Strategy: Intelligently selects the best extraction method based on content patterns
  • Readability Strategy: Uses Mozilla Readability algorithm for clean article extraction
  • Manual Strategy: Custom CSS selectors for complex or non-standard sites
  • Full Strategy: Complete page extraction including navigation and footers
  • Structured Strategy: Schema.org and JSON-LD structured data extraction

🌐 Comprehensive Site Support

  • Blog Posts & Articles: Optimized extraction with Readability algorithm
  • Documentation Sites: Custom selector support for technical docs
  • Single Page Applications: Multiple strategies to handle dynamic content
  • E-commerce & Complex Sites: Flexible extraction options
  • News Sites & Publications: Metadata-aware extraction with author/date

📋 Advanced Processing Options

  • Image Handling: Skip, link-only, download locally, or embed as base64
  • Link Classification: Automatic internal/external link detection and processing
  • Metadata Extraction: Title, author, published date, description with fallbacks
  • Frontmatter Generation: YAML frontmatter with configurable metadata inclusion

🔧 Professional Features

  • Authentication Support: Custom headers, cookies, and user-agent strings
  • Batch Processing: Process multiple URLs from text files
  • Error Handling: Comprehensive error recovery and detailed reporting
  • Dry-Run Mode: Preview extraction without creating files
  • JSON Output: Programmatic interface for integration with other tools

Implementation Highlights

🏗️ Architecture

  • Core WebClipper Class: Main extraction engine with strategy pattern
  • CLI Command Interface: Full-featured command with extensive options
  • TypeScript Strict Mode: Zero type coercion, comprehensive interfaces
  • Modular Design: Extensible strategy system for future enhancements

🧪 Testing

  • 42 Total Tests: Comprehensive coverage of all functionality
  • Core Tests (28): WebClipper class methods, strategies, error handling
  • CLI Tests (14): Command interface, option parsing, file operations
  • Mock Integration: External dependencies properly mocked (jsdom, readability, turndown)

🔒 Quality Assurance

  • Strict TypeScript: No any types in production code
  • Error Boundaries: Graceful handling of network issues, parsing failures
  • Cross-Platform: Proper filename sanitization and path handling
  • Resource Management: Timeout controls and memory-conscious processing

Usage Examples

# Basic article extraction
markmv clip https://example.com/article

# Custom output location
markmv clip https://blog.com/post -o article.md

# Batch processing with custom strategy
markmv clip urls.txt --batch --strategy manual --selectors "article,.content"

# Download images locally
markmv clip https://tutorial.com --image-strategy download --image-dir ./images

# Authentication-aware extraction
markmv clip https://protected.com --headers '{"Authorization": "Bearer token"}'

# Dry-run with verbose output
markmv clip https://example.com --dry-run --verbose

CLI Integration

The new clip command integrates seamlessly with the existing markmv CLI:

  • Follows established patterns for options and output formatting
  • Supports dry-run mode consistent with other commands
  • Provides JSON output for programmatic usage
  • Includes comprehensive help text and examples

Test Coverage

All functionality is thoroughly tested:

  • ✅ Strategy detection and selection
  • ✅ Content extraction from various HTML structures
  • ✅ Metadata parsing and frontmatter generation
  • ✅ Image and link processing
  • ✅ Error handling and recovery
  • ✅ CLI option parsing and validation
  • ✅ Batch processing workflows
  • ✅ File output and directory creation

Technical Implementation

  • Dependencies: Added @mozilla/readability, turndown, jsdom, node-html-parser
  • Type Safety: Comprehensive interfaces without any type coercion
  • Performance: Efficient parsing with timeout controls and resource management
  • Maintainability: Clean separation of concerns with strategy pattern
  • Extensibility: Easy to add new extraction strategies or output formats

Closes #38

Test Plan

  • All existing tests pass
  • New functionality has comprehensive test coverage
  • CLI integration works correctly
  • Error handling is robust
  • Cross-platform compatibility verified
  • TypeScript strict mode compliance
  • Linting standards met (test file warnings acceptable)

Implement multi-strategy web page to markdown conversion with:

**Core Features:**
- Multiple extraction strategies: auto, readability, manual, full, structured
- Mozilla Readability integration for article extraction
- Custom CSS selector support for complex sites
- Schema.org structured data extraction
- Auto-strategy detection based on content patterns

**Content Processing:**
- HTML to Markdown conversion with Turndown.js
- Image extraction and processing (skip, link-only, download, base64)
- Link extraction and classification (internal/external)
- Metadata extraction (title, author, published date, description)
- Frontmatter generation with configurable options

**Advanced Features:**
- Custom HTTP headers and authentication support
- Cookie file support for protected content
- Configurable timeouts and redirect handling
- Batch processing from URL files
- Comprehensive error handling and retry logic
- Dry-run mode for preview without file creation

**CLI Interface:**
- Full-featured `markmv clip` command with extensive options
- Support for single URLs, batch processing, and custom output paths
- JSON output format for programmatic usage
- Verbose logging and detailed progress reporting
- Integration with existing markmv command structure

**TypeScript Implementation:**
- Strict type safety without any type coercion
- Comprehensive interfaces for all data structures
- Proper error handling with typed exceptions
- Full test coverage for all functionality

**Testing:**
- Comprehensive test suite for WebClipper core class (28 tests)
- CLI command tests covering all scenarios (14 tests)
- Mocked external dependencies (jsdom, readability, turndown)
- Error condition testing and edge case coverage
- Cross-platform filename sanitization tests

This addresses the need for robust web content extraction that handles
various site architectures including SPAs, documentation sites, blogs,
and structured content with appropriate strategy selection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add web page to markdown conversion (like Markdown Web Clipper)
1 participant