DataScrapexter is a universal web scraper built with Go that combines high performance, intelligent anti-detection mechanisms, and configuration-driven operation to enable seamless data extraction from any website structure.
- Universal Compatibility: Scrape any website type - e-commerce, news, directories, social media
- Advanced Anti-Detection: Sophisticated evasion techniques including proxy rotation, browser fingerprinting, and CAPTCHA solving
- Configuration-Driven: No-code setup through YAML configuration files
- High Performance: Go's concurrency model for processing 10,000+ pages per hour
- JavaScript Support: Headless browser automation for dynamic content
- Legal Compliance: Built-in ethical scraping and legal compliance features
- Multiple Output Formats: JSON, CSV, Excel, databases, and cloud storage
- Real-time Monitoring: Comprehensive metrics and health monitoring
Download the latest binary for your platform from the releases page.
go install github.com/valpere/DataScrapexter/cmd/datascrapexter@latest
docker pull ghcr.io/valpere/datascrapexter:latest
docker run -v $(pwd)/configs:/app/configs ghcr.io/valpere/datascrapexter:latest run /app/configs/example.yaml
go get github.com/valpere/DataScrapexter
# example.yaml
name: "example_scraper"
base_url: "https://example.com"
# Anti-detection settings
user_agents:
- "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
rate_limit: "2s"
# Data extraction rules
fields:
- name: "title"
selector: "h1"
type: "text"
- name: "price"
selector: ".price"
type: "text"
- name: "description"
selector: ".description"
type: "text"
# Output configuration
output:
format: "json"
file: "output.json"
# Validate configuration
datascrapexter validate example.yaml
# Run scraper
datascrapexter run example.yaml
# Generate template
datascrapexter template > new_config.yaml
cat output.json
- Installation Guide - Detailed installation instructions
- User Guide - Comprehensive guide for using DataScrapexter
- Quick Start Tutorial - Get scraping in under 5 minutes
- Configuration Reference - Complete configuration options
- Example: E-commerce Scraper - Build a price monitoring system
- Example: News Scraper - Extract articles and metadata
- Example: Job Board Scraper - Collect job listings
- Example: Real Estate Scraper - Property listing extraction
- API Documentation - Go package API reference
- Code Documentation - Internal architecture and design
- CLI Reference - Command-line interface documentation
- Contributing Guide - How to contribute to the project
- Troubleshooting Guide - Common issues and solutions
- FAQ - Frequently asked questions
- GitHub Discussions - Community support
DataScrapexter uses YAML configuration files to define scraping behavior. Here's a comprehensive example:
name: "advanced_scraper"
base_url: "https://target-site.com"
# Browser automation
browser:
enabled: true
headless: true
user_data_dir: "/tmp/datascrapexter"
# Anti-detection mechanisms
anti_detection:
proxy:
enabled: true
rotation: "random"
providers:
- "http://proxy1:8080"
- "http://proxy2:8080"
captcha:
solver: "2captcha"
api_key: "${CAPTCHA_API_KEY}"
fingerprinting:
randomize_viewport: true
spoof_canvas: true
rotate_user_agents: true
# Rate limiting
rate_limit:
requests_per_second: 2
burst: 5
adaptive: true
# Data extraction
fields:
- name: "product_name"
selector: "h1.product-title"
type: "text"
required: true
- name: "price"
selector: ".price-current"
type: "text"
transform:
- type: "regex"
pattern: "\\$([0-9,]+\\.?[0-9]*)"
replacement: "$1"
- type: "parse_float"
# Pagination
pagination:
type: "next_button"
selector: ".pagination .next"
max_pages: 100
# Output options
output:
format: "json"
file: "products.json"
database:
type: "postgresql"
url: "${DATABASE_URL}"
table: "products"
package main
import (
"context"
"log"
"github.com/valpere/DataScrapexter/pkg/scraper"
"github.com/valpere/DataScrapexter/pkg/config"
)
func main() {
// Load configuration
cfg, err := config.LoadFromFile("config.yaml")
if err != nil {
log.Fatal(err)
}
// Create scraper instance
s, err := scraper.New(cfg)
if err != nil {
log.Fatal(err)
}
// Run scraping
ctx := context.Background()
results, err := s.Scrape(ctx)
if err != nil {
log.Fatal(err)
}
// Process results
for _, result := range results {
log.Printf("Extracted: %+v", result.Data)
}
}
version: '3.8'
services:
datascrapexter:
image: ghcr.io/valpere/datascrapexter:latest
volumes:
- ./configs:/app/configs
- ./output:/app/output
environment:
- CAPTCHA_API_KEY=${CAPTCHA_API_KEY}
- DATABASE_URL=${DATABASE_URL}
command: run /app/configs/production.yaml
redis:
image: redis:alpine
ports:
- "6379:6379"
postgres:
image: postgres:13
environment:
POSTGRES_DB: datascrapexter
POSTGRES_USER: scraper
POSTGRES_PASSWORD: password
ports:
- "5432:5432"
DataScrapexter includes built-in compliance features:
- Robots.txt Respect: Automatic robots.txt parsing and compliance
- Rate Limiting: Configurable delays to avoid server overload
- User-Agent Identification: Transparent identification in requests
- Data Privacy: GDPR-compliant data handling and anonymization
- Terms of Service: Monitoring and compliance checking
- Always check and respect robots.txt
- Implement reasonable rate limiting
- Avoid scraping personal data without consent
- Review website terms of service
- Use scraped data responsibly and legally
We welcome contributions! Please see our Contributing Guidelines for details.
# Clone repository
git clone https://github.com/valpere/DataScrapexter.git
cd DataScrapexter
# Install dependencies
go mod download
# Run tests
go test ./...
# Build
go build -o bin/datascrapexter cmd/datascrapexter/main.go
# Run locally
./bin/datascrapexter run examples/basic.yaml
DataScrapexter/
โโโ cmd/ # Command-line applications
โโโ internal/ # Private application code
โโโ pkg/ # Public API packages
โโโ configs/ # Configuration templates
โโโ docs/ # Documentation
โโโ examples/ # Usage examples
โโโ scripts/ # Build and deployment scripts
โโโ test/ # Integration tests
DataScrapexter is designed for high performance:
- Throughput: 10,000+ pages per hour per instance
- Concurrency: 1,000+ concurrent goroutines
- Memory Usage: <100MB for typical workloads
- Scalability: Horizontal scaling with distributed processing
# Run performance benchmarks
go test -bench=. ./internal/scraper
go test -bench=. ./internal/antidetect
# Example results:
BenchmarkHTTPClient-8 1000000 1200 ns/op 320 B/op 4 allocs/op
BenchmarkHTMLParser-8 500000 2400 ns/op 512 B/op 8 allocs/op
BenchmarkProxyRotation-8 2000000 800 ns/op 128 B/op 2 allocs/op
DataScrapexter includes pre-built templates for common use cases:
datascrapexter template --type ecommerce > ecommerce.yaml
datascrapexter template --type news > news.yaml
datascrapexter template --type jobs > jobs.yaml
datascrapexter template --type social > social.yaml
# Pull latest image
docker pull ghcr.io/valpere/datascrapexter:latest
# Run with local config
docker run -v $(pwd)/config.yaml:/app/config.yaml \
-v $(pwd)/output:/app/output \
ghcr.io/valpere/datascrapexter:latest run /app/config.yaml
# Build production image
docker build -t datascrapexter:prod .
# Run with environment variables
docker run -e CAPTCHA_API_KEY=$CAPTCHA_KEY \
-e DATABASE_URL=$DB_URL \
-v $(pwd)/configs:/app/configs \
datascrapexter:prod run /app/configs/production.yaml
DataScrapexter exposes Prometheus metrics at /metrics
:
# Start with metrics enabled
datascrapexter serve --metrics-port 9090
# View metrics
curl http://localhost:9090/metrics
datascrapexter_requests_total
- Total HTTP requests madedatascrapexter_requests_duration_seconds
- Request duration histogramdatascrapexter_extraction_success_rate
- Data extraction success ratedatascrapexter_proxy_health
- Proxy health statusdatascrapexter_captcha_solve_rate
- CAPTCHA solving success rate
# Health check endpoint
curl http://localhost:8080/health
# Readiness check
curl http://localhost:8080/ready
Run DataScrapexter as a web service:
# Start API server
datascrapexter serve --port 8080
# Submit scraping job via API
curl -X POST http://localhost:8080/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{
"name": "test_job",
"config": {
"base_url": "https://example.com",
"fields": [
{"name": "title", "selector": "h1", "type": "text"}
]
}
}'
# Check job status
curl http://localhost:8080/api/v1/jobs/{job_id}
# Get results
curl http://localhost:8080/api/v1/jobs/{job_id}/results
- Bright Data (formerly Luminati)
- Oxylabs
- Smartproxy
- ProxyMesh
- Custom proxy lists
- 2Captcha
- Anti-Captcha
- CapMonster
- DeathByCaptcha
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
- MinIO
- PostgreSQL
- MySQL
- MongoDB
- Redis
- InfluxDB
- Apache Kafka
- RabbitMQ
- Redis Streams
- AWS SQS
# Check proxy health
datascrapexter proxy-test --config config.yaml
# Validate configuration
datascrapexter validate config.yaml --strict
# Enable debug logging
datascrapexter run config.yaml --log-level debug
# Profile memory usage
go tool pprof http://localhost:6060/debug/pprof/heap
# Enable memory profiling
datascrapexter run config.yaml --pprof-port 6060
# Test CAPTCHA solver
datascrapexter captcha-test --solver 2captcha --api-key YOUR_KEY
# Check CAPTCHA solve rates
curl http://localhost:9090/metrics | grep captcha_solve_rate
# Run with maximum verbosity
datascrapexter run config.yaml \
--log-level debug \
--log-format json \
--pprof-port 6060 \
--metrics-port 9090
name: "product_scraper"
base_url: "https://shop.example.com"
fields:
- name: "product_name"
selector: "h1.product-title"
type: "text"
- name: "price"
selector: ".price"
type: "text"
- name: "in_stock"
selector: ".stock-status"
type: "text"
pagination:
type: "next_button"
selector: ".pagination-next"
max_pages: 50
output:
format: "csv"
file: "products.csv"
name: "news_scraper"
base_url: "https://news.example.com"
fields:
- name: "headline"
selector: "h1.article-headline"
type: "text"
- name: "author"
selector: ".article-author"
type: "text"
- name: "publish_date"
selector: ".publish-date"
type: "text"
- name: "content"
selector: ".article-content"
type: "text"
navigation:
follow_links:
- selector: ".article-link"
max_depth: 2
output:
format: "json"
file: "articles.json"
name: "spa_scraper"
base_url: "https://spa.example.com"
browser:
enabled: true
headless: true
wait_for_element: ".dynamic-content"
timeout: 30s
fields:
- name: "dynamic_data"
selector: ".ajax-loaded"
type: "text"
wait_for_element: true
anti_detection:
fingerprinting:
randomize_viewport: true
spoof_canvas: true
disable_webdriver_flags: true
output:
format: "json"
file: "spa_data.json"
- Basic HTTP scraping engine
- Configuration-driven setup
- CLI interface
- Anti-detection basics
- JavaScript rendering support
- Advanced proxy management
- Web dashboard
- Template marketplace
- Distributed processing
- Advanced monitoring
- Enterprise features
- API server mode
- ML-powered adaptation
- Intelligent content detection
- Predictive scaling
- Auto-configuration
- Multi-tenant architecture
- Advanced compliance
- Professional services
- Enterprise integrations
This project is licensed under the MIT License - see the LICENSE file for details.
- Colly - Elegant scraper framework for Go
- Goquery - jQuery-like HTML parsing
- chromedp - Chrome DevTools Protocol
- Cobra - CLI library for Go
- Viper - Configuration management
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
- Discord: DataScrapexter Community
Made with โค๏ธ by the DataScrapexter team