Skip to content

Universal web scraper built with Go featuring advanced anti-detection, ethical compliance, and configuration-driven operation for any website.

License

Notifications You must be signed in to change notification settings

valpere/DataScrapexter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

24 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

DataScrapexter

Go Version License Go Report Card codecov GitHub release

DataScrapexter is a universal web scraper built with Go that combines high performance, intelligent anti-detection mechanisms, and configuration-driven operation to enable seamless data extraction from any website structure.

IN PROGRESS

๐Ÿš€ Features

  • Universal Compatibility: Scrape any website type - e-commerce, news, directories, social media
  • Advanced Anti-Detection: Sophisticated evasion techniques including proxy rotation, browser fingerprinting, and CAPTCHA solving
  • Configuration-Driven: No-code setup through YAML configuration files
  • High Performance: Go's concurrency model for processing 10,000+ pages per hour
  • JavaScript Support: Headless browser automation for dynamic content
  • Legal Compliance: Built-in ethical scraping and legal compliance features
  • Multiple Output Formats: JSON, CSV, Excel, databases, and cloud storage
  • Real-time Monitoring: Comprehensive metrics and health monitoring

๐Ÿ“ฆ Installation

Binary Releases

Download the latest binary for your platform from the releases page.

From Source

go install github.com/valpere/DataScrapexter/cmd/datascrapexter@latest

Docker

docker pull ghcr.io/valpere/datascrapexter:latest
docker run -v $(pwd)/configs:/app/configs ghcr.io/valpere/datascrapexter:latest run /app/configs/example.yaml

Go Module

go get github.com/valpere/DataScrapexter

๐Ÿƒโ€โ™‚๏ธ Quick Start

1. Create a Configuration File

# example.yaml
name: "example_scraper"
base_url: "https://example.com"

# Anti-detection settings
user_agents:
  - "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
rate_limit: "2s"

# Data extraction rules
fields:
  - name: "title"
    selector: "h1"
    type: "text"
  - name: "price"
    selector: ".price"
    type: "text"
  - name: "description"
    selector: ".description"
    type: "text"

# Output configuration
output:
  format: "json"
  file: "output.json"

2. Run the Scraper

# Validate configuration
datascrapexter validate example.yaml

# Run scraper
datascrapexter run example.yaml

# Generate template
datascrapexter template > new_config.yaml

3. View Results

cat output.json

๐Ÿ“– Documentation

Getting Started

Configuration & Examples

Developer Resources

Support

๐Ÿ”ง Configuration

DataScrapexter uses YAML configuration files to define scraping behavior. Here's a comprehensive example:

name: "advanced_scraper"
base_url: "https://target-site.com"

# Browser automation
browser:
  enabled: true
  headless: true
  user_data_dir: "/tmp/datascrapexter"

# Anti-detection mechanisms
anti_detection:
  proxy:
    enabled: true
    rotation: "random"
    providers:
      - "http://proxy1:8080"
      - "http://proxy2:8080"
  
  captcha:
    solver: "2captcha"
    api_key: "${CAPTCHA_API_KEY}"
  
  fingerprinting:
    randomize_viewport: true
    spoof_canvas: true
    rotate_user_agents: true

# Rate limiting
rate_limit:
  requests_per_second: 2
  burst: 5
  adaptive: true

# Data extraction
fields:
  - name: "product_name"
    selector: "h1.product-title"
    type: "text"
    required: true
  
  - name: "price"
    selector: ".price-current"
    type: "text"
    transform:
      - type: "regex"
        pattern: "\\$([0-9,]+\\.?[0-9]*)"
        replacement: "$1"
      - type: "parse_float"

# Pagination
pagination:
  type: "next_button"
  selector: ".pagination .next"
  max_pages: 100

# Output options
output:
  format: "json"
  file: "products.json"
  database:
    type: "postgresql"
    url: "${DATABASE_URL}"
    table: "products"

๐Ÿ› ๏ธ Advanced Usage

Programmatic Usage

package main

import (
    "context"
    "log"
    
    "github.com/valpere/DataScrapexter/pkg/scraper"
    "github.com/valpere/DataScrapexter/pkg/config"
)

func main() {
    // Load configuration
    cfg, err := config.LoadFromFile("config.yaml")
    if err != nil {
        log.Fatal(err)
    }
    
    // Create scraper instance
    s, err := scraper.New(cfg)
    if err != nil {
        log.Fatal(err)
    }
    
    // Run scraping
    ctx := context.Background()
    results, err := s.Scrape(ctx)
    if err != nil {
        log.Fatal(err)
    }
    
    // Process results
    for _, result := range results {
        log.Printf("Extracted: %+v", result.Data)
    }
}

Docker Compose

version: '3.8'
services:
  datascrapexter:
    image: ghcr.io/valpere/datascrapexter:latest
    volumes:
      - ./configs:/app/configs
      - ./output:/app/output
    environment:
      - CAPTCHA_API_KEY=${CAPTCHA_API_KEY}
      - DATABASE_URL=${DATABASE_URL}
    command: run /app/configs/production.yaml
    
  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
      
  postgres:
    image: postgres:13
    environment:
      POSTGRES_DB: datascrapexter
      POSTGRES_USER: scraper
      POSTGRES_PASSWORD: password
    ports:
      - "5432:5432"

๐Ÿ”’ Legal & Ethical Usage

DataScrapexter includes built-in compliance features:

  • Robots.txt Respect: Automatic robots.txt parsing and compliance
  • Rate Limiting: Configurable delays to avoid server overload
  • User-Agent Identification: Transparent identification in requests
  • Data Privacy: GDPR-compliant data handling and anonymization
  • Terms of Service: Monitoring and compliance checking

Best Practices

  1. Always check and respect robots.txt
  2. Implement reasonable rate limiting
  3. Avoid scraping personal data without consent
  4. Review website terms of service
  5. Use scraped data responsibly and legally

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone repository
git clone https://github.com/valpere/DataScrapexter.git
cd DataScrapexter

# Install dependencies
go mod download

# Run tests
go test ./...

# Build
go build -o bin/datascrapexter cmd/datascrapexter/main.go

# Run locally
./bin/datascrapexter run examples/basic.yaml

Project Structure

DataScrapexter/
โ”œโ”€โ”€ cmd/                    # Command-line applications
โ”œโ”€โ”€ internal/               # Private application code
โ”œโ”€โ”€ pkg/                    # Public API packages
โ”œโ”€โ”€ configs/                # Configuration templates
โ”œโ”€โ”€ docs/                   # Documentation
โ”œโ”€โ”€ examples/               # Usage examples
โ”œโ”€โ”€ scripts/                # Build and deployment scripts
โ””โ”€โ”€ test/                   # Integration tests

๐Ÿ“Š Performance

DataScrapexter is designed for high performance:

  • Throughput: 10,000+ pages per hour per instance
  • Concurrency: 1,000+ concurrent goroutines
  • Memory Usage: <100MB for typical workloads
  • Scalability: Horizontal scaling with distributed processing

Benchmarks

# Run performance benchmarks
go test -bench=. ./internal/scraper
go test -bench=. ./internal/antidetect

# Example results:
BenchmarkHTTPClient-8           1000000    1200 ns/op     320 B/op    4 allocs/op
BenchmarkHTMLParser-8            500000    2400 ns/op     512 B/op    8 allocs/op
BenchmarkProxyRotation-8        2000000     800 ns/op     128 B/op    2 allocs/op

๐Ÿ”ง Configuration Templates

DataScrapexter includes pre-built templates for common use cases:

E-commerce Sites

datascrapexter template --type ecommerce > ecommerce.yaml

News Websites

datascrapexter template --type news > news.yaml

Job Boards

datascrapexter template --type jobs > jobs.yaml

Social Media

datascrapexter template --type social > social.yaml

๐Ÿณ Docker Usage

Basic Usage

# Pull latest image
docker pull ghcr.io/valpere/datascrapexter:latest

# Run with local config
docker run -v $(pwd)/config.yaml:/app/config.yaml \
           -v $(pwd)/output:/app/output \
           ghcr.io/valpere/datascrapexter:latest run /app/config.yaml

Production Deployment

# Build production image
docker build -t datascrapexter:prod .

# Run with environment variables
docker run -e CAPTCHA_API_KEY=$CAPTCHA_KEY \
           -e DATABASE_URL=$DB_URL \
           -v $(pwd)/configs:/app/configs \
           datascrapexter:prod run /app/configs/production.yaml

๐Ÿ“ˆ Monitoring & Observability

Metrics Endpoint

DataScrapexter exposes Prometheus metrics at /metrics:

# Start with metrics enabled
datascrapexter serve --metrics-port 9090

# View metrics
curl http://localhost:9090/metrics

Key Metrics

  • datascrapexter_requests_total - Total HTTP requests made
  • datascrapexter_requests_duration_seconds - Request duration histogram
  • datascrapexter_extraction_success_rate - Data extraction success rate
  • datascrapexter_proxy_health - Proxy health status
  • datascrapexter_captcha_solve_rate - CAPTCHA solving success rate

Health Checks

# Health check endpoint
curl http://localhost:8080/health

# Readiness check
curl http://localhost:8080/ready

๐ŸŒ API Server Mode

Run DataScrapexter as a web service:

# Start API server
datascrapexter serve --port 8080

# Submit scraping job via API
curl -X POST http://localhost:8080/api/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "name": "test_job",
    "config": {
      "base_url": "https://example.com",
      "fields": [
        {"name": "title", "selector": "h1", "type": "text"}
      ]
    }
  }'

# Check job status
curl http://localhost:8080/api/v1/jobs/{job_id}

# Get results
curl http://localhost:8080/api/v1/jobs/{job_id}/results

๐Ÿ”Œ Integrations

Proxy Providers

  • Bright Data (formerly Luminati)
  • Oxylabs
  • Smartproxy
  • ProxyMesh
  • Custom proxy lists

CAPTCHA Solvers

  • 2Captcha
  • Anti-Captcha
  • CapMonster
  • DeathByCaptcha

Cloud Storage

  • AWS S3
  • Google Cloud Storage
  • Azure Blob Storage
  • MinIO

Databases

  • PostgreSQL
  • MySQL
  • MongoDB
  • Redis
  • InfluxDB

Message Queues

  • Apache Kafka
  • RabbitMQ
  • Redis Streams
  • AWS SQS

๐Ÿšจ Troubleshooting

Common Issues

High Error Rates

# Check proxy health
datascrapexter proxy-test --config config.yaml

# Validate configuration
datascrapexter validate config.yaml --strict

# Enable debug logging
datascrapexter run config.yaml --log-level debug

Memory Usage

# Profile memory usage
go tool pprof http://localhost:6060/debug/pprof/heap

# Enable memory profiling
datascrapexter run config.yaml --pprof-port 6060

CAPTCHA Issues

# Test CAPTCHA solver
datascrapexter captcha-test --solver 2captcha --api-key YOUR_KEY

# Check CAPTCHA solve rates
curl http://localhost:9090/metrics | grep captcha_solve_rate

Debug Mode

# Run with maximum verbosity
datascrapexter run config.yaml \
  --log-level debug \
  --log-format json \
  --pprof-port 6060 \
  --metrics-port 9090

๐Ÿ“š Examples

Basic E-commerce Scraping

name: "product_scraper"
base_url: "https://shop.example.com"

fields:
  - name: "product_name"
    selector: "h1.product-title"
    type: "text"
  - name: "price"
    selector: ".price"
    type: "text"
  - name: "in_stock"
    selector: ".stock-status"
    type: "text"

pagination:
  type: "next_button"
  selector: ".pagination-next"
  max_pages: 50

output:
  format: "csv"
  file: "products.csv"

News Article Extraction

name: "news_scraper"
base_url: "https://news.example.com"

fields:
  - name: "headline"
    selector: "h1.article-headline"
    type: "text"
  - name: "author"
    selector: ".article-author"
    type: "text"
  - name: "publish_date"
    selector: ".publish-date"
    type: "text"
  - name: "content"
    selector: ".article-content"
    type: "text"

navigation:
  follow_links:
    - selector: ".article-link"
      max_depth: 2

output:
  format: "json"
  file: "articles.json"

JavaScript-Heavy Site

name: "spa_scraper"
base_url: "https://spa.example.com"

browser:
  enabled: true
  headless: true
  wait_for_element: ".dynamic-content"
  timeout: 30s

fields:
  - name: "dynamic_data"
    selector: ".ajax-loaded"
    type: "text"
    wait_for_element: true

anti_detection:
  fingerprinting:
    randomize_viewport: true
    spoof_canvas: true
    disable_webdriver_flags: true

output:
  format: "json"
  file: "spa_data.json"

๐Ÿ”ฎ Roadmap

v0.1.0 - MVP (Current)

  • Basic HTTP scraping engine
  • Configuration-driven setup
  • CLI interface
  • Anti-detection basics

v0.5.0 - Enhanced

  • JavaScript rendering support
  • Advanced proxy management
  • Web dashboard
  • Template marketplace

v1.0.0 - Professional

  • Distributed processing
  • Advanced monitoring
  • Enterprise features
  • API server mode

v1.5.0 - AI-Enhanced

  • ML-powered adaptation
  • Intelligent content detection
  • Predictive scaling
  • Auto-configuration

v2.0.0 - Enterprise

  • Multi-tenant architecture
  • Advanced compliance
  • Professional services
  • Enterprise integrations

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Colly - Elegant scraper framework for Go
  • Goquery - jQuery-like HTML parsing
  • chromedp - Chrome DevTools Protocol
  • Cobra - CLI library for Go
  • Viper - Configuration management

๐Ÿ“ž Support

๐ŸŒŸ Star History

Star History Chart


Made with โค๏ธ by the DataScrapexter team

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •