Skip to content

volk6022/Atomic-Scraper-Service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Atomic Scraper Service

High-throughput atomic scraping and stateful interactive browser sessions with LLM orchestration.

Features

  • Stateless Scraper: Fast atomic scraping and Google search transformation (Serper-compatible).
  • Stateful Sessions: Interactive browser sessions via DSL over WebSockets with Taskiq Actors.
  • AI Integration: Omni-Parser for UI grounding (SoM approach) and local HTML→Markdown conversion (/html-to-md).
  • Resource Management: Inactivity timeout for stateful sessions via SESSION_INACTIVITY_TIMEOUT (default 600 s; dev docker-compose.override.yml sets 1800 s).
  • Modular Design: Clean architecture with layers for API, Domain, Infrastructure, and Actions.
  • Docker Production Ready: Dockerfile with Playwright, docker-compose with api/worker/redis, health endpoint.
  • Anti-Bot Evasion: Stealth browser pool with User-Agent rotation, proxy integration, human-like interactions.
  • Yandex Maps Extraction: Extract structured business data (name, address, phone, website, geo coordinates).
  • Site Content Enrichment: Extract clean text from company websites with optional about/services page crawling.
  • Research Agent: Autonomous AI research as a flat tool-calling loop (chat.completions with tool_choice="auto"), two output modes (free-form markdown or caller-supplied JSON Schema), critic-gate on submit, generic — no domain logic in the service.
  • Per-Domain Rate Limiting: Redis-based token bucket (30/hour for *.yandex.*, 1000/hour fallback).

Tech Stack

  • Language: Python 3.11+
  • Framework: FastAPI
  • Async Logic: Playwright, Taskiq (Redis Broker)
  • AI Tools: Flexible OpenAI-compatible configuration (LM Studio, OpenAI, etc.) with two logical endpoints (EXTRACTION_* and ORCHESTRATION_*), Omni-Parser, optional Jina/Reader-LM as the extraction model.
  • Infrastructure: Redis (Pub/Sub, Taskiq broker and KV store), SearXNG (search backend), Docker

Detailed directory layout and layer responsibilities are documented in STRUCTURE.md.


Quickstart

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • AI Providers: LM Studio (local), OpenAI (cloud), or any OpenAI-compatible API.

Installation (local dev)

# 1. Install dependencies
uv sync

# 2. Install Playwright browsers (must use uv run — playwright lives in the venv)
uv run playwright install --with-deps chromium

# 3. Copy and edit environment config
cp .env.example .env

Key .env values:

API_KEY=default_internal_key          # internal auth header value

EXTRACTION_API_BASE=http://localhost:1234/v1
EXTRACTION_API_KEY=lm-studio
EXTRACTION_MODEL_NAME=jina-reader-lm

ORCHESTRATION_API_BASE=http://localhost:20022/v1/   # local llama.cpp / vLLM / LM Studio
ORCHESTRATION_API_KEY=lm-studio
ORCHESTRATION_MODEL_NAME=qwen3.5-9b-claude-4.6-opus-reasoning-distilled

REDIS_URL=redis://localhost:16379      # Taskiq broker + pub/sub + KV; docker-compose maps host 16379 → container 6379
SESSION_INACTIVITY_TIMEOUT=600         # seconds; dev override sets 1800

# Optional — Research Agent tuning (defaults match production v2.1):
# RESEARCH_COMPACT_TRIGGER_TOKENS=50000
# RESEARCH_CRITIC_PASS_SCORE=8.5
# RESEARCH_MAX_SUBMIT_REJECTS=2
# RESEARCH_DEFAULT_LANGUAGE=ru
# RESEARCH_PROMPTS_PATH=src/actions/research/research_agent_prompts.yaml

Full settings list lives in src/core/config.py (pydantic-settings); the full optional RESEARCH_* block is documented in .env.example.

Proxy Configuration (optional)

Create proxies.txt in the project root, one proxy per line in http://user:pass@host:port format:

http://user:pass@1.2.3.4:8080
http://user:pass@5.6.7.8:8080

Note for Yandex Maps: Yandex blocks datacenter proxy IPs at the browser level. Use residential proxies (e.g., Bright Data, Oxylabs) to get actual scraping results.

Docker Production

# Build and start all services (api, worker, redis)
docker compose up -d

# Check health
curl http://localhost:8000/healthz
# {"status":"healthy", "timestamp":"...", "services":{...}}
# Note: handler always returns 200; the unhealthy/degraded → 503 branch is not implemented today.

proxies.txt is automatically bind-mounted into the container if it exists. Important: create proxies.txt before the first docker compose up. If Docker creates it as a directory (a known Docker Desktop on Windows quirk), run docker compose down && docker compose up -d to remount correctly.

Run API without Docker (PM2)

pm2 start ecosystem.config.js

API Reference

All endpoints except /healthz require the header X-API-Key: <API_KEY>.

Health

GET /healthz
→ 200 {"status": "healthy", "timestamp": "...", "services": {...}}

Probes Redis ping + pool_manager. Currently always returns 200 (the 503/degraded branch documented in spec is not yet wired — see docs/codebase-report/20-spec-vs-reality.md, C-09).

Stateless Atomic Endpoints

POST /scraper      {"url": "...", ...}                   # Playwright one-shot page fetch
POST /serper       {"q": "...", "num": 10}               # SERP via SearXNG (Serper-compatible shape)
POST /omni-parse   {"base64_image": "...", "prompt": ""} # OmniParser vision call (via LLM facade)
POST /html-to-md   {"html": "...", ...}                  # local HTML → Markdown / text

The historical /jina-extract endpoint from earlier specs is not implemented — it was replaced by /html-to-md (local conversion via content_cleaner).

Yandex Maps Extraction

POST /api/v1/yandex-maps/extract
{
  "category": "restaurants",
  "center": {"lat": 59.934, "lng": 30.306},   // lat ∈ [-90,90], lng ∈ [-180,180]
  "radius": 1000                                // metres, 100–5000
}
→ {"businesses": [...], "total": N, "category": "...", "center": {...}, "radius": N}

Each business card: name, address, and optionally phone, website, geo.

Site Content Enrichment

POST /api/v1/enrich
{
  "url": "https://example.com",
  "crawl_about": false,
  "crawl_services": false
}
→ {"url": "...", "text": "...", "word_count": N, "truncated": bool, "pages_crawled": [...]}

Content is truncated to ≤ 500 words. Raw HTML is stripped.

Research Agent

POST /api/v1/research/run
{
  "query": "research topic",                  // 3-8000 chars
  "mode": "speed" | "balanced" | "quality",   // default: "balanced"
  "language": "ru" | "en" | ...,              // BCP-47-ish hint; routed into prompts + SearXNG
  "max_iters": 15,                            // optional override of preset.max_turns (1-50)
  "max_tokens": 100000,                       // optional override of preset.token_budget (1k-2M)
  "output_schema": {...}                      // optional JSON Schema → structured-output mode
}
→ 202 {"task_id": "...", "status": "pending", "message": "Research task queued"}

GET /api/v1/research/status/{task_id}
→ {"task_id": "...", "status": "completed"|"running"|"failed", "result": ResearchReport, ...}

GET /api/v1/research/stream/{task_id}
→ text/event-stream with progress events

The agent is a flat tool-calling loop (src/actions/research/agent.py:run_research). It exposes three tools to the LLM — web_serp (SearXNG), web_scrape (Playwright via SiteEnrichAction), and a dynamic terminal submit — and lets the model decide what to call (tool_choice="auto"). A second LLM acts as critic on submit: low scores get rejected with feedback, force-accepted after RESEARCH_MAX_SUBMIT_REJECTS.

ResearchReport shape (free-form mode fills answer_markdown; schema mode fills structured_output; the other stays at its empty default):

{
  "query": "...",
  "mode": "balanced",
  "answer_markdown": "<markdown>",
  "structured_output": null,
  "sources": [{"url": "https://...", "what_it_provided": "..."}],
  "critic": {"score": 9.0, "verdict": "pass", "feedback": "..."},
  "stats": {
    "turns": 4, "tool_calls": {"web_serp": 1, "web_scrape": 2, "submit_answer": 1},
    "tokens": {"main": {...}, "aux": {...}, "grand_total": 7644},
    "elapsed_seconds": 51.4, "mode_used": "balanced",
    "submit_attempts": 1, "compactions": 0,
    "target_language": "en", "had_output_schema": false
  },
  "trace_summary": {...}
}

All numeric knobs are in src/core/config.py (RESEARCH_* settings, overridable via .env) and all prompts in src/actions/research/research_agent_prompts.yaml.

Rate Limiting

  • *.yandex.* domains: 30 requests/hour
  • All other domains: 1000 requests/hour
  • Exceeded limit: 429 Too Many Requests with Retry-After header

Interactive Browser Sessions (DSL)

POST /sessions                              → {"session_id": "..."}
POST /sessions/{id}/command  {"type": "goto", "params": {"url": "..."}}
DELETE /sessions/{id}
WS   /ws/{session_id}                       # streaming commands

DSL commands: goto, scroll, click_coord, click_omni, type, screenshot, extract_jina


Testing

# Full test suite (99 tests, no Docker needed except 2 live E2E tests)
python -m pytest tests/ -q

# Run the two live E2E tests (requires docker compose up)
python -m pytest tests/e2e/test_site_enrichment_flow.py::test_enrichment_returns_clean_text \
                 tests/e2e/test_yandex_maps_full_flow.py::test_yandex_maps_endpoint_returns_businesses -v

Test breakdown:

Suite Tests Notes
unit/ 17 No external dependencies
contract/ 28 In-process FastAPI via ASGITransport
integration/ 31 Mocked browser; Docker config checks
e2e/ 23 Structural + middleware; 2 hit live localhost:8000

MCP Server

The project includes an MCP (Model Context Protocol) server exposing all web interactions as tools.

Session tools communicate via POST /sessions/{id}/command (HTTP) instead of WebSocket — MCP runs over stdio and cannot maintain a persistent WS connection.

Running

uv run python -m src.mcp_server

Claude Desktop / OpenCode Configuration

{
  "mcpServers": {
    "atomic-scraper": {
      "command": "uv",
      "args": [
        "run", "--project", "C:/[repo_path]/Atomic-Scraper-Service",
        "python", "C:/[repo_path]/Atomic-Scraper-Service/src/mcp_server.py"
      ],
      "env": {
        "API_KEY": "default_internal_key"
      }
    }
  }
}

Available Tools

  • Stateless: scrape, search, omni_parse, jina_extract
  • Data Extraction: yandex_maps_extract, enrich_website
  • Research Agent: research_run, research_status, research_stream
  • Session Management: create_session, delete_session
  • Interactive (DSL): session_goto, session_scroll, session_click, session_type, session_screenshot, session_click_omni, session_extract_jina

Documentation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors