feat: Add comprehensive web clipper functionality for issue #38 #42

Mearman · 2025-07-30T14:01:04Z

Summary

Implements comprehensive web page to markdown conversion functionality as requested in issue #38. This adds a powerful clip command to the markmv CLI that can extract content from various types of web pages using multiple intelligent strategies.

Key Features

🚀 Multi-Strategy Content Extraction

Auto Strategy: Intelligently selects the best extraction method based on content patterns
Readability Strategy: Uses Mozilla Readability algorithm for clean article extraction
Manual Strategy: Custom CSS selectors for complex or non-standard sites
Full Strategy: Complete page extraction including navigation and footers
Structured Strategy: Schema.org and JSON-LD structured data extraction

🌐 Comprehensive Site Support

Blog Posts & Articles: Optimized extraction with Readability algorithm
Documentation Sites: Custom selector support for technical docs
Single Page Applications: Multiple strategies to handle dynamic content
E-commerce & Complex Sites: Flexible extraction options
News Sites & Publications: Metadata-aware extraction with author/date

📋 Advanced Processing Options

Image Handling: Skip, link-only, download locally, or embed as base64
Link Classification: Automatic internal/external link detection and processing
Metadata Extraction: Title, author, published date, description with fallbacks
Frontmatter Generation: YAML frontmatter with configurable metadata inclusion

🔧 Professional Features

Authentication Support: Custom headers, cookies, and user-agent strings
Batch Processing: Process multiple URLs from text files
Error Handling: Comprehensive error recovery and detailed reporting
Dry-Run Mode: Preview extraction without creating files
JSON Output: Programmatic interface for integration with other tools

Implementation Highlights

🏗️ Architecture

Core WebClipper Class: Main extraction engine with strategy pattern
CLI Command Interface: Full-featured command with extensive options
TypeScript Strict Mode: Zero type coercion, comprehensive interfaces
Modular Design: Extensible strategy system for future enhancements

🧪 Testing

42 Total Tests: Comprehensive coverage of all functionality
Core Tests (28): WebClipper class methods, strategies, error handling
CLI Tests (14): Command interface, option parsing, file operations
Mock Integration: External dependencies properly mocked (jsdom, readability, turndown)

🔒 Quality Assurance

Strict TypeScript: No any types in production code
Error Boundaries: Graceful handling of network issues, parsing failures
Cross-Platform: Proper filename sanitization and path handling
Resource Management: Timeout controls and memory-conscious processing

Usage Examples

# Basic article extraction
markmv clip https://example.com/article

# Custom output location
markmv clip https://blog.com/post -o article.md

# Batch processing with custom strategy
markmv clip urls.txt --batch --strategy manual --selectors "article,.content"

# Download images locally
markmv clip https://tutorial.com --image-strategy download --image-dir ./images

# Authentication-aware extraction
markmv clip https://protected.com --headers '{"Authorization": "Bearer token"}'

# Dry-run with verbose output
markmv clip https://example.com --dry-run --verbose

CLI Integration

The new clip command integrates seamlessly with the existing markmv CLI:

Follows established patterns for options and output formatting
Supports dry-run mode consistent with other commands
Provides JSON output for programmatic usage
Includes comprehensive help text and examples

Test Coverage

All functionality is thoroughly tested:

✅ Strategy detection and selection
✅ Content extraction from various HTML structures
✅ Metadata parsing and frontmatter generation
✅ Image and link processing
✅ Error handling and recovery
✅ CLI option parsing and validation
✅ Batch processing workflows
✅ File output and directory creation

Technical Implementation

Dependencies: Added @mozilla/readability, turndown, jsdom, node-html-parser
Type Safety: Comprehensive interfaces without any type coercion
Performance: Efficient parsing with timeout controls and resource management
Maintainability: Clean separation of concerns with strategy pattern
Extensibility: Easy to add new extraction strategies or output formats

Closes #38

Test Plan

All existing tests pass
New functionality has comprehensive test coverage
CLI integration works correctly
Error handling is robust
Cross-platform compatibility verified
TypeScript strict mode compliance
Linting standards met (test file warnings acceptable)

Implement multi-strategy web page to markdown conversion with: **Core Features:** - Multiple extraction strategies: auto, readability, manual, full, structured - Mozilla Readability integration for article extraction - Custom CSS selector support for complex sites - Schema.org structured data extraction - Auto-strategy detection based on content patterns **Content Processing:** - HTML to Markdown conversion with Turndown.js - Image extraction and processing (skip, link-only, download, base64) - Link extraction and classification (internal/external) - Metadata extraction (title, author, published date, description) - Frontmatter generation with configurable options **Advanced Features:** - Custom HTTP headers and authentication support - Cookie file support for protected content - Configurable timeouts and redirect handling - Batch processing from URL files - Comprehensive error handling and retry logic - Dry-run mode for preview without file creation **CLI Interface:** - Full-featured `markmv clip` command with extensive options - Support for single URLs, batch processing, and custom output paths - JSON output format for programmatic usage - Verbose logging and detailed progress reporting - Integration with existing markmv command structure **TypeScript Implementation:** - Strict type safety without any type coercion - Comprehensive interfaces for all data structures - Proper error handling with typed exceptions - Full test coverage for all functionality **Testing:** - Comprehensive test suite for WebClipper core class (28 tests) - CLI command tests covering all scenarios (14 tests) - Mocked external dependencies (jsdom, readability, turndown) - Error condition testing and edge case coverage - Cross-platform filename sanitization tests This addresses the need for robust web content extraction that handles various site architectures including SPAs, documentation sites, blogs, and structured content with appropriate strategy selection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add comprehensive web clipper functionality for issue #38 #42

feat: Add comprehensive web clipper functionality for issue #38 #42

Uh oh!

Mearman commented Jul 30, 2025

Uh oh!

Uh oh!

feat: Add comprehensive web clipper functionality for issue #38 #42

Are you sure you want to change the base?

feat: Add comprehensive web clipper functionality for issue #38 #42

Uh oh!

Conversation

Mearman commented Jul 30, 2025

Summary

Key Features

🚀 Multi-Strategy Content Extraction

🌐 Comprehensive Site Support

📋 Advanced Processing Options

🔧 Professional Features

Implementation Highlights

🏗️ Architecture

🧪 Testing

🔒 Quality Assurance

Usage Examples

CLI Integration

Test Coverage

Technical Implementation

Test Plan

Uh oh!

Uh oh!