Atomic Scraper Service

High-throughput atomic scraping and stateful interactive browser sessions with LLM orchestration.

Features

Stateless Scraper: Fast atomic scraping and Google search transformation (Serper-compatible).
Stateful Sessions: Interactive browser sessions via DSL over WebSockets with Taskiq Actors.
AI Integration: Omni-Parser for UI grounding (SoM approach) and local HTML→Markdown conversion (/html-to-md).
Resource Management: Inactivity timeout for stateful sessions via SESSION_INACTIVITY_TIMEOUT (default 600 s; dev docker-compose.override.yml sets 1800 s).
Modular Design: Clean architecture with layers for API, Domain, Infrastructure, and Actions.
Docker Production Ready: Dockerfile with Playwright, docker-compose with api/worker/redis, health endpoint.
Anti-Bot Evasion: Stealth browser pool with User-Agent rotation, proxy integration, human-like interactions.
Yandex Maps Extraction: Extract structured business data (name, address, phone, website, geo coordinates).
Site Content Enrichment: Extract clean text from company websites with optional about/services page crawling.
Research Agent: Autonomous AI research as a flat tool-calling loop (chat.completions with tool_choice="auto"), two output modes (free-form markdown or caller-supplied JSON Schema), critic-gate on submit, generic — no domain logic in the service.
Per-Domain Rate Limiting: Redis-based token bucket (30/hour for *.yandex.*, 1000/hour fallback).

Tech Stack

Language: Python 3.11+
Framework: FastAPI
Async Logic: Playwright, Taskiq (Redis Broker)
AI Tools: Flexible OpenAI-compatible configuration (LM Studio, OpenAI, etc.) with two logical endpoints (EXTRACTION_* and ORCHESTRATION_*), Omni-Parser, optional Jina/Reader-LM as the extraction model.
Infrastructure: Redis (Pub/Sub, Taskiq broker and KV store), SearXNG (search backend), Docker

Project Structure

Detailed directory layout and layer responsibilities are documented in STRUCTURE.md.

Quickstart

Prerequisites

Python 3.11+
Docker & Docker Compose
AI Providers: LM Studio (local), OpenAI (cloud), or any OpenAI-compatible API.

Installation (local dev)

# 1. Install dependencies
uv sync

# 2. Install Playwright browsers (must use uv run — playwright lives in the venv)
uv run playwright install --with-deps chromium

# 3. Copy and edit environment config
cp .env.example .env

Key .env values:

API_KEY=default_internal_key          # internal auth header value

EXTRACTION_API_BASE=http://localhost:1234/v1
EXTRACTION_API_KEY=lm-studio
EXTRACTION_MODEL_NAME=jina-reader-lm

ORCHESTRATION_API_BASE=http://localhost:20022/v1/   # local llama.cpp / vLLM / LM Studio
ORCHESTRATION_API_KEY=lm-studio
ORCHESTRATION_MODEL_NAME=qwen3.5-9b-claude-4.6-opus-reasoning-distilled

REDIS_URL=redis://localhost:16379      # Taskiq broker + pub/sub + KV; docker-compose maps host 16379 → container 6379
SESSION_INACTIVITY_TIMEOUT=600         # seconds; dev override sets 1800

# Optional — Research Agent tuning (defaults match production v2.1):
# RESEARCH_COMPACT_TRIGGER_TOKENS=50000
# RESEARCH_CRITIC_PASS_SCORE=8.5
# RESEARCH_MAX_SUBMIT_REJECTS=2
# RESEARCH_DEFAULT_LANGUAGE=ru
# RESEARCH_PROMPTS_PATH=src/actions/research/research_agent_prompts.yaml

Full settings list lives in src/core/config.py (pydantic-settings); the full optional RESEARCH_* block is documented in .env.example.

Proxy Configuration (optional)

Create proxies.txt in the project root, one proxy per line in http://user:pass@host:port format:

http://user:pass@1.2.3.4:8080
http://user:pass@5.6.7.8:8080

Note for Yandex Maps: Yandex blocks datacenter proxy IPs at the browser level. Use residential proxies (e.g., Bright Data, Oxylabs) to get actual scraping results.

Docker Production

# Build and start all services (api, worker, redis)
docker compose up -d

# Check health
curl http://localhost:8000/healthz
# {"status":"healthy", "timestamp":"...", "services":{...}}
# Note: handler always returns 200; the unhealthy/degraded → 503 branch is not implemented today.

proxies.txt is automatically bind-mounted into the container if it exists. Important: create proxies.txt before the first docker compose up. If Docker creates it as a directory (a known Docker Desktop on Windows quirk), run docker compose down && docker compose up -d to remount correctly.

Run API without Docker (PM2)

pm2 start ecosystem.config.js

API Reference

All endpoints except /healthz require the header X-API-Key: <API_KEY>.

Health

GET /healthz
→ 200 {"status": "healthy", "timestamp": "...", "services": {...}}

Probes Redis ping + pool_manager. Currently always returns 200 (the 503/degraded branch documented in spec is not yet wired — see docs/codebase-report/20-spec-vs-reality.md, C-09).

Stateless Atomic Endpoints

POST /scraper      {"url": "...", ...}                   # Playwright one-shot page fetch
POST /serper       {"q": "...", "num": 10}               # SERP via SearXNG (Serper-compatible shape)
POST /omni-parse   {"base64_image": "...", "prompt": ""} # OmniParser vision call (via LLM facade)
POST /html-to-md   {"html": "...", ...}                  # local HTML → Markdown / text

The historical /jina-extract endpoint from earlier specs is not implemented — it was replaced by /html-to-md (local conversion via content_cleaner).

Yandex Maps Extraction

POST /api/v1/yandex-maps/extract
{
  "category": "restaurants",
  "center": {"lat": 59.934, "lng": 30.306},   // lat ∈ [-90,90], lng ∈ [-180,180]
  "radius": 1000                                // metres, 100–5000
}
→ {"businesses": [...], "total": N, "category": "...", "center": {...}, "radius": N}

Each business card: name, address, and optionally phone, website, geo.

Site Content Enrichment

POST /api/v1/enrich
{
  "url": "https://example.com",
  "crawl_about": false,
  "crawl_services": false
}
→ {"url": "...", "text": "...", "word_count": N, "truncated": bool, "pages_crawled": [...]}

Content is truncated to ≤ 500 words. Raw HTML is stripped.

Research Agent

POST /api/v1/research/run
{
  "query": "research topic",                  // 3-8000 chars
  "mode": "speed" | "balanced" | "quality",   // default: "balanced"
  "language": "ru" | "en" | ...,              // BCP-47-ish hint; routed into prompts + SearXNG
  "max_iters": 15,                            // optional override of preset.max_turns (1-50)
  "max_tokens": 100000,                       // optional override of preset.token_budget (1k-2M)
  "output_schema": {...}                      // optional JSON Schema → structured-output mode
}
→ 202 {"task_id": "...", "status": "pending", "message": "Research task queued"}

GET /api/v1/research/status/{task_id}
→ {"task_id": "...", "status": "completed"|"running"|"failed", "result": ResearchReport, ...}

GET /api/v1/research/stream/{task_id}
→ text/event-stream with progress events

The agent is a flat tool-calling loop (src/actions/research/agent.py:run_research). It exposes three tools to the LLM — web_serp (SearXNG), web_scrape (Playwright via SiteEnrichAction), and a dynamic terminal submit — and lets the model decide what to call (tool_choice="auto"). A second LLM acts as critic on submit: low scores get rejected with feedback, force-accepted after RESEARCH_MAX_SUBMIT_REJECTS.

ResearchReport shape (free-form mode fills answer_markdown; schema mode fills structured_output; the other stays at its empty default):

{
  "query": "...",
  "mode": "balanced",
  "answer_markdown": "<markdown>",
  "structured_output": null,
  "sources": [{"url": "https://...", "what_it_provided": "..."}],
  "critic": {"score": 9.0, "verdict": "pass", "feedback": "..."},
  "stats": {
    "turns": 4, "tool_calls": {"web_serp": 1, "web_scrape": 2, "submit_answer": 1},
    "tokens": {"main": {...}, "aux": {...}, "grand_total": 7644},
    "elapsed_seconds": 51.4, "mode_used": "balanced",
    "submit_attempts": 1, "compactions": 0,
    "target_language": "en", "had_output_schema": false
  },
  "trace_summary": {...}
}

All numeric knobs are in src/core/config.py (RESEARCH_* settings, overridable via .env) and all prompts in src/actions/research/research_agent_prompts.yaml.

Rate Limiting

*.yandex.* domains: 30 requests/hour
All other domains: 1000 requests/hour
Exceeded limit: 429 Too Many Requests with Retry-After header

Interactive Browser Sessions (DSL)

POST /sessions                              → {"session_id": "..."}
POST /sessions/{id}/command  {"type": "goto", "params": {"url": "..."}}
DELETE /sessions/{id}
WS   /ws/{session_id}                       # streaming commands

DSL commands: goto, scroll, click_coord, click_omni, type, screenshot, extract_jina

Testing

# Full test suite (99 tests, no Docker needed except 2 live E2E tests)
python -m pytest tests/ -q

# Run the two live E2E tests (requires docker compose up)
python -m pytest tests/e2e/test_site_enrichment_flow.py::test_enrichment_returns_clean_text \
                 tests/e2e/test_yandex_maps_full_flow.py::test_yandex_maps_endpoint_returns_businesses -v

Test breakdown:

Suite	Tests	Notes
unit/	17	No external dependencies
contract/	28	In-process FastAPI via `ASGITransport`
integration/	31	Mocked browser; Docker config checks
e2e/	23	Structural + middleware; 2 hit live `localhost:8000`

MCP Server

The project includes an MCP (Model Context Protocol) server exposing all web interactions as tools.

Session tools communicate via POST /sessions/{id}/command (HTTP) instead of WebSocket — MCP runs over stdio and cannot maintain a persistent WS connection.

Running

uv run python -m src.mcp_server

Claude Desktop / OpenCode Configuration

{
  "mcpServers": {
    "atomic-scraper": {
      "command": "uv",
      "args": [
        "run", "--project", "C:/[repo_path]/Atomic-Scraper-Service",
        "python", "C:/[repo_path]/Atomic-Scraper-Service/src/mcp_server.py"
      ],
      "env": {
        "API_KEY": "default_internal_key"
      }
    }
  }
}

Available Tools

Stateless: scrape, search, omni_parse, jina_extract
Data Extraction: yandex_maps_extract, enrich_website
Research Agent: research_run, research_status, research_stream
Session Management: create_session, delete_session
Interactive (DSL): session_goto, session_scroll, session_click, session_type, session_screenshot, session_click_omni, session_extract_jina

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.claude		.claude
.opencode/command		.opencode/command
.specify		.specify
ab_test		ab_test
design-notes		design-notes
docs		docs
infra/searxng		infra/searxng
optimize-scrape		optimize-scrape
parse-yandex-economy-experiment		parse-yandex-economy-experiment
perf_analysis		perf_analysis
proxy_experiment		proxy_experiment
reports		reports
research_analysis		research_analysis
serp_experiment		serp_experiment
specs		specs
src		src
tests		tests
tools		tools
verify_endpoint		verify_endpoint
yandex_enrichment_experiment		yandex_enrichment_experiment
yandex_maps_experiment		yandex_maps_experiment
.dockerignore		.dockerignore
.env.example		.env.example
.env.local-backup		.env.local-backup
.gitignore		.gitignore
.python-version		.python-version
.txt		.txt
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
PLAN_PROXY_AND_MONITORING_FIX.md		PLAN_PROXY_AND_MONITORING_FIX.md
README.md		README.md
REVIEW-auto-monitor-ml-cv.md		REVIEW-auto-monitor-ml-cv.md
STRUCTURE.md		STRUCTURE.md
api_full_test.ipynb		api_full_test.ipynb
debug_output.txt		debug_output.txt
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
ecosystem.config.js		ecosystem.config.js
ecosystem.llama-deep.config.js		ecosystem.llama-deep.config.js
kill_mcp.py		kill_mcp.py
llama.txt		llama.txt
main.py		main.py
out_noproxy_vpn-on.json		out_noproxy_vpn-on.json
out_proxy_vpn-on.json		out_proxy_vpn-on.json
playwright_orchestrator_plan_2.md		playwright_orchestrator_plan_2.md
proxies.example.txt		proxies.example.txt
pyproject.toml		pyproject.toml
research_agent_plan.md		research_agent_plan.md
restart_mcp.py		restart_mcp.py
test_all_interactions.py		test_all_interactions.py
test_google_debug.py		test_google_debug.py
test_keys.py		test_keys.py
test_stability.py		test_stability.py
test_ws.py		test_ws.py
to-the-unified-specs.txt		to-the-unified-specs.txt
uv.lock		uv.lock
web_interactions.md		web_interactions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atomic Scraper Service

Features

Tech Stack

Project Structure

Quickstart

Prerequisites

Installation (local dev)

Proxy Configuration (optional)

Docker Production

Run API without Docker (PM2)

API Reference

Health

Stateless Atomic Endpoints

Yandex Maps Extraction

Site Content Enrichment

Research Agent

Rate Limiting

Interactive Browser Sessions (DSL)

Testing

MCP Server

Running

Claude Desktop / OpenCode Configuration

Available Tools

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Atomic Scraper Service

Features

Tech Stack

Project Structure

Quickstart

Prerequisites

Installation (local dev)

Proxy Configuration (optional)

Docker Production

Run API without Docker (PM2)

API Reference

Health

Stateless Atomic Endpoints

Yandex Maps Extraction

Site Content Enrichment

Research Agent

Rate Limiting

Interactive Browser Sessions (DSL)

Testing

MCP Server

Running

Claude Desktop / OpenCode Configuration

Available Tools

Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages