Run DOMShell with local models via nexa-sdk for fully on-device browser automation — no cloud API needed.
This integration connects DOMShell's MCP-based browser tools to nexa-sdk's local inference engine. The agent script acts as an MCP client that:
- Connects to DOMShell's MCP server (via the stdio proxy)
- Fetches tool definitions and converts them to function-calling format
- Sends the user's task to a local LLM running via
nexa serve(OpenAI-compatible API) - Parses function calls from LLM output, executes via MCP, loops until done
┌─────────────────┐
│ nexa serve │
│ (local LLM) │
│ OpenAI API │
│ :8080/v1 │
└────────▲────────┘
│ HTTP
┌──────────────┐ stdio ┌─────────────┐ │ HTTP ┌──────────────┐ WS ┌─────────────┐
│ agent.py │ ◄────────────► │ proxy.ts │ ◄───────────► │ MCP Server │ ◄──────────► │ Chrome Ext │
│ (MCP client)│ MCP protocol │ (bridge) │ /mcp │ (index.ts) │ port 9876 │ (DOMShell) │
└──────────────┘ └─────────────┘ └──────────────┘ └─────────────┘
Nexa's existing Web-Agent-Qwen3VL cookbook uses Playwright + screenshots + a 4B vision model. DOMShell takes a different approach:
| Aspect | DOMShell (this integration) | Web-Agent (browser-use) |
|---|---|---|
| Input to LLM | Structured text (accessibility tree) | Screenshots (pixels) |
| Model requirement | Any text LLM with function calling | Vision-language model (4B+) |
| Token efficiency | Low (text commands + text results) | High (image tokens are expensive) |
| Element targeting | By name/role/path (e.g. click submit_btn) |
By screen coordinates |
| Output format | Structured data (markdown tables, link lists) | Raw text from screenshots |
| Minimum model size | 0.6B+ (text-only) | 4B+ (VLM required) |
- DOMShell Chrome extension installed and running
- DOMShell MCP server running (
cd mcp-server && npx tsx index.ts --no-confirm --allow-all) - nexa-sdk installed: see nexa-sdk installation
- Node.js and npm (for the MCP proxy)
- Python 3.10+
cd integrations/nexa
pip install -r requirements.txt# Pull a model (examples — use whatever is available for your platform)
nexa pull NexaAI/Qwen3-1.7B-4bit-MLX # Apple Silicon (MLX)
nexa pull NexaAI/Qwen3-4B-4bit-MLX # Apple Silicon, larger
nexa pull NexaAI/Granite-4-Micro-GGUF # Cross-platform (GGUF)
# Start the local inference server (default: 127.0.0.1:18181)
nexa servenexa serve exposes an OpenAI-compatible API at http://127.0.0.1:18181/v1. The agent auto-discovers which model is loaded.
In a separate terminal:
cd mcp-server
npx tsx index.ts --no-confirm --allow-all# Read-only task (extract content)
python agent.py --task "Open wikipedia.org/wiki/Artificial_intelligence and extract the first paragraph" --verbose
# With write access (clicking, typing, form submission)
python agent.py --task "Go to google.com and search for nexa ai" --allow-write --verbose
# Compact mode for smaller models (uses domshell_execute as single tool)
python agent.py --task "Open wikipedia.org/wiki/AI and list all headings" --mode compact
# Specify a model hint (matches against loaded models)
python agent.py --task "Summarize this page" --model qwen3-4b --verbose| Flag | Default | Description |
|---|---|---|
--task |
(required) | Task for the agent to perform |
--nexa-endpoint |
http://127.0.0.1:18181/v1 |
Nexa serve OpenAI-compatible API endpoint |
--model |
(auto-discover) | Model name hint — matches against models loaded in nexa serve |
--port |
3001 |
DOMShell MCP server port |
--token |
(none) | DOMShell MCP auth token |
--mode |
full |
full (all tools) or compact (domshell_execute only) |
--allow-write |
off | Include write-tier tools (click, type, submit, navigate, open) |
--max-turns |
20 |
Maximum agent loop iterations |
--verbose |
off | Print each turn's tool calls and results |
Full mode (--mode full): Exposes all DOMShell tools as individual functions. Best for models with 8K+ context windows that can handle 15+ tool definitions.
Compact mode (--mode compact): Exposes only domshell_execute as a single tool that accepts any command string. Better for smaller models with limited context, at the cost of losing structured parameter validation.
# Content extraction
python agent.py --task "Open wikipedia.org/wiki/Machine_learning and extract the first 5 headings"
# Link harvesting
python agent.py --task "Open wikipedia.org/wiki/AI and list all links in the See Also section"
# Table extraction
python agent.py --task "Open wikipedia.org/wiki/Large_language_model and extract the comparison table"
# Multi-page navigation (requires --allow-write)
python agent.py --task "Open the AI Wikipedia article, find the first link in See Also, follow it, and extract the first paragraph" --allow-write
# Form interaction (requires --allow-write)
python agent.py --task "Go to wikipedia.org, search for 'neural network', and extract the first paragraph" --allow-writeThe agent auto-discovers models from nexa serve. Use whatever model you have pulled locally. Some options:
| Model | Size | Platform | Function Calling | Notes |
|---|---|---|---|---|
NexaAI/Qwen3-1.7B-4bit-MLX |
~1GB | Apple Silicon | Basic | Good starting point for Mac users |
NexaAI/Qwen3-4B-4bit-MLX |
~2.5GB | Apple Silicon | Good | Better reasoning, recommended for multi-step tasks |
NexaAI/Granite-4-Micro-GGUF |
~2GB | Cross-platform | Good | Optimized for tool use |
NexaAI/Qwen3-0.6B-GGUF |
~400MB | Cross-platform | Basic | Use with --mode compact for minimal footprint |
Tip: The agent sends enable_thinking: false in API requests to disable Qwen3's <think> blocks. If you use a backend that doesn't support this parameter, the agent will still strip think blocks during function call extraction.
Issues discovered during the nexa_ollama experiment:
| Issue | Impact | Workaround |
|---|---|---|
| MLX 4-bit quantization quality (#688) | MLX 4-bit hallucinates more than GGUF K-quants for the same model. Uniform bit allocation loses precision on important weights. | Use GGUF models when quality matters more than speed, or use Ollama as backend. |
| No constrained generation (#459) | Small models can hallucinate arbitrary tokens in tool call JSON with no guardrails. Open since Oct 2024. | Use larger models (8B+) or specialized function-calling models like Octopus-V2. |
| Multi-turn context dropped (#634) | nexa serve may ignore system prompts and earlier messages without Nexa-KeepCache: true header. |
The agent now sends this header automatically. |
| Safety over-refusal | Qwen3's RLHF safety tuning can trigger on benign content when quantized aggressively (e.g., refusing to return Wikipedia text). | Use larger models or less aggressive quantization. |
The agent works with any OpenAI-compatible API. To use Ollama instead of nexa serve:
# Start Ollama and pull a model
ollama serve
ollama pull qwen3:4b
# Run agent with Ollama endpoint
python agent.py --task "..." --nexa-endpoint http://127.0.0.1:11434/v1 --model qwen3 --allow-write --verboseIn our experiments, Ollama + Qwen3-4B outperformed Nexa serve + Qwen3-4B significantly (1.50 vs 0.67 avg correctness), likely due to better GGUF K-quant quantization and cleaner thinking/content separation.
The agent follows this loop:
- User provides a task (e.g., "Extract the first paragraph from the AI article")
- Agent connects to
nexa serve(auto-discovers loaded model) and DOMShell MCP server - The system prompt includes DOMShell instructions and tool definitions
- The LLM generates a JSON function call:
{"name": "domshell_open", "arguments": {"url": "wikipedia.org/wiki/AI"}} - The agent parses the JSON, executes the tool via MCP, gets the result
- The result is fed back to the LLM as context
- The LLM decides: call another tool, or give the final answer
- Loop repeats until the LLM responds in plain text (no function call)
This is the same pattern as Nexa's function-calling cookbook, extended from single-shot (1 tool call) to multi-turn (5-15 calls for browser tasks).
Make sure nexa serve is running:
# Check if it's running
curl http://127.0.0.1:18181/v1/models
# Start it (default port 18181)
nexa servePull a model first:
nexa pull NexaAI/Qwen3-1.7B-4bit-MLX # Apple Silicon
nexa pull NexaAI/Granite-4-Micro-GGUF # Cross-platformThe DOMShell MCP server isn't running. Start it with:
cd mcp-server && npx tsx index.ts --no-confirm --allow-allThe DOMShell Chrome extension isn't connected. Open the extension popup and verify the WebSocket connection is active.
The model may not support function calling well. Try:
- A larger model (4B+ instead of 0.6B)
- Compact mode (
--mode compact) which simplifies the tool interface - Adding
--verboseto see what the model is actually outputting
The task may be too complex for the model. Try:
- Breaking it into smaller sub-tasks
- Using
--max-turns 30for complex multi-page tasks - A more capable model