Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
agent.py	agent.py
requirements.txt	requirements.txt

DOMShell + Nexa SDK: Local LLM Browser Automation

Run DOMShell with local models via nexa-sdk for fully on-device browser automation — no cloud API needed.

Overview

This integration connects DOMShell's MCP-based browser tools to nexa-sdk's local inference engine. The agent script acts as an MCP client that:

Connects to DOMShell's MCP server (via the stdio proxy)
Fetches tool definitions and converts them to function-calling format
Sends the user's task to a local LLM running via nexa serve (OpenAI-compatible API)
Parses function calls from LLM output, executes via MCP, loops until done

                                        ┌─────────────────┐
                                        │   nexa serve     │
                                        │   (local LLM)    │
                                        │   OpenAI API     │
                                        │   :8080/v1       │
                                        └────────▲────────┘
                                                 │ HTTP
┌──────────────┐     stdio      ┌─────────────┐ │  HTTP      ┌──────────────┐     WS       ┌─────────────┐
│  agent.py    │ ◄────────────► │  proxy.ts   │ ◄───────────► │  MCP Server  │ ◄──────────► │  Chrome Ext │
│  (MCP client)│  MCP protocol  │  (bridge)   │  /mcp         │  (index.ts)  │  port 9876   │  (DOMShell) │
└──────────────┘                └─────────────┘               └──────────────┘              └─────────────┘

Why DOMShell vs Browser-Use (Playwright)?

Nexa's existing Web-Agent-Qwen3VL cookbook uses Playwright + screenshots + a 4B vision model. DOMShell takes a different approach:

Aspect	DOMShell (this integration)	Web-Agent (browser-use)
Input to LLM	Structured text (accessibility tree)	Screenshots (pixels)
Model requirement	Any text LLM with function calling	Vision-language model (4B+)
Token efficiency	Low (text commands + text results)	High (image tokens are expensive)
Element targeting	By name/role/path (e.g. `click submit_btn`)	By screen coordinates
Output format	Structured data (markdown tables, link lists)	Raw text from screenshots
Minimum model size	0.6B+ (text-only)	4B+ (VLM required)

Prerequisites

DOMShell Chrome extension installed and running
DOMShell MCP server running (cd mcp-server && npx tsx index.ts --no-confirm --allow-all)
nexa-sdk installed: see nexa-sdk installation
Node.js and npm (for the MCP proxy)
Python 3.10+

Quick Start

1. Install dependencies

cd integrations/nexa
pip install -r requirements.txt

2. Pull a model and start nexa serve

# Pull a model (examples — use whatever is available for your platform)
nexa pull NexaAI/Qwen3-1.7B-4bit-MLX    # Apple Silicon (MLX)
nexa pull NexaAI/Qwen3-4B-4bit-MLX       # Apple Silicon, larger
nexa pull NexaAI/Granite-4-Micro-GGUF    # Cross-platform (GGUF)

# Start the local inference server (default: 127.0.0.1:18181)
nexa serve

nexa serve exposes an OpenAI-compatible API at http://127.0.0.1:18181/v1. The agent auto-discovers which model is loaded.

3. Start the DOMShell MCP server

In a separate terminal:

cd mcp-server
npx tsx index.ts --no-confirm --allow-all

4. Run the agent

# Read-only task (extract content)
python agent.py --task "Open wikipedia.org/wiki/Artificial_intelligence and extract the first paragraph" --verbose

# With write access (clicking, typing, form submission)
python agent.py --task "Go to google.com and search for nexa ai" --allow-write --verbose

# Compact mode for smaller models (uses domshell_execute as single tool)
python agent.py --task "Open wikipedia.org/wiki/AI and list all headings" --mode compact

# Specify a model hint (matches against loaded models)
python agent.py --task "Summarize this page" --model qwen3-4b --verbose

Usage

Command-line options

Flag	Default	Description
`--task`	(required)	Task for the agent to perform
`--nexa-endpoint`	`http://127.0.0.1:18181/v1`	Nexa serve OpenAI-compatible API endpoint
`--model`	(auto-discover)	Model name hint — matches against models loaded in nexa serve
`--port`	`3001`	DOMShell MCP server port
`--token`	(none)	DOMShell MCP auth token
`--mode`	`full`	`full` (all tools) or `compact` (domshell_execute only)
`--allow-write`	off	Include write-tier tools (click, type, submit, navigate, open)
`--max-turns`	`20`	Maximum agent loop iterations
`--verbose`	off	Print each turn's tool calls and results

Modes

Full mode (--mode full): Exposes all DOMShell tools as individual functions. Best for models with 8K+ context windows that can handle 15+ tool definitions.

Compact mode (--mode compact): Exposes only domshell_execute as a single tool that accepts any command string. Better for smaller models with limited context, at the cost of losing structured parameter validation.

Example tasks

# Content extraction
python agent.py --task "Open wikipedia.org/wiki/Machine_learning and extract the first 5 headings"

# Link harvesting
python agent.py --task "Open wikipedia.org/wiki/AI and list all links in the See Also section"

# Table extraction
python agent.py --task "Open wikipedia.org/wiki/Large_language_model and extract the comparison table"

# Multi-page navigation (requires --allow-write)
python agent.py --task "Open the AI Wikipedia article, find the first link in See Also, follow it, and extract the first paragraph" --allow-write

# Form interaction (requires --allow-write)
python agent.py --task "Go to wikipedia.org, search for 'neural network', and extract the first paragraph" --allow-write

Model Recommendations

The agent auto-discovers models from nexa serve. Use whatever model you have pulled locally. Some options:

Model	Size	Platform	Function Calling	Notes
`NexaAI/Qwen3-1.7B-4bit-MLX`	~1GB	Apple Silicon	Basic	Good starting point for Mac users
`NexaAI/Qwen3-4B-4bit-MLX`	~2.5GB	Apple Silicon	Good	Better reasoning, recommended for multi-step tasks
`NexaAI/Granite-4-Micro-GGUF`	~2GB	Cross-platform	Good	Optimized for tool use
`NexaAI/Qwen3-0.6B-GGUF`	~400MB	Cross-platform	Basic	Use with `--mode compact` for minimal footprint

Tip: The agent sends enable_thinking: false in API requests to disable Qwen3's <think> blocks. If you use a backend that doesn't support this parameter, the agent will still strip think blocks during function call extraction.

Known Limitations

Issues discovered during the nexa_ollama experiment:

Issue	Impact	Workaround
MLX 4-bit quantization quality (#688)	MLX 4-bit hallucinates more than GGUF K-quants for the same model. Uniform bit allocation loses precision on important weights.	Use GGUF models when quality matters more than speed, or use Ollama as backend.
No constrained generation (#459)	Small models can hallucinate arbitrary tokens in tool call JSON with no guardrails. Open since Oct 2024.	Use larger models (8B+) or specialized function-calling models like Octopus-V2.
Multi-turn context dropped (#634)	nexa serve may ignore system prompts and earlier messages without `Nexa-KeepCache: true` header.	The agent now sends this header automatically.
Safety over-refusal	Qwen3's RLHF safety tuning can trigger on benign content when quantized aggressively (e.g., refusing to return Wikipedia text).	Use larger models or less aggressive quantization.

Using Ollama as an alternative backend

The agent works with any OpenAI-compatible API. To use Ollama instead of nexa serve:

# Start Ollama and pull a model
ollama serve
ollama pull qwen3:4b

# Run agent with Ollama endpoint
python agent.py --task "..." --nexa-endpoint http://127.0.0.1:11434/v1 --model qwen3 --allow-write --verbose

In our experiments, Ollama + Qwen3-4B outperformed Nexa serve + Qwen3-4B significantly (1.50 vs 0.67 avg correctness), likely due to better GGUF K-quant quantization and cleaner thinking/content separation.

How It Works

The agent follows this loop:

User provides a task (e.g., "Extract the first paragraph from the AI article")
Agent connects to nexa serve (auto-discovers loaded model) and DOMShell MCP server
The system prompt includes DOMShell instructions and tool definitions
The LLM generates a JSON function call: {"name": "domshell_open", "arguments": {"url": "wikipedia.org/wiki/AI"}}
The agent parses the JSON, executes the tool via MCP, gets the result
The result is fed back to the LLM as context
The LLM decides: call another tool, or give the final answer
Loop repeats until the LLM responds in plain text (no function call)

This is the same pattern as Nexa's function-calling cookbook, extended from single-shot (1 tool call) to multi-turn (5-15 calls for browser tasks).

Troubleshooting

"Cannot reach nexa serve"

Make sure nexa serve is running:

# Check if it's running
curl http://127.0.0.1:18181/v1/models

# Start it (default port 18181)
nexa serve

"nexa serve has no models loaded"

Pull a model first:

nexa pull NexaAI/Qwen3-1.7B-4bit-MLX    # Apple Silicon
nexa pull NexaAI/Granite-4-Micro-GGUF    # Cross-platform

"Connection refused" on port 3001

The DOMShell MCP server isn't running. Start it with:

cd mcp-server && npx tsx index.ts --no-confirm --allow-all

"No browser tab found"

The DOMShell Chrome extension isn't connected. Open the extension popup and verify the WebSocket connection is active.

Model outputs text instead of JSON

The model may not support function calling well. Try:

A larger model (4B+ instead of 0.6B)
Compact mode (--mode compact) which simplifies the tool interface
Adding --verbose to see what the model is actually outputting

Max turns reached

The task may be too complex for the model. Try:

Breaking it into smaller sub-tasks
Using --max-turns 30 for complex multi-page tasks
A more capable model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DOMShell + Nexa SDK: Local LLM Browser Automation

Overview

Why DOMShell vs Browser-Use (Playwright)?

Prerequisites

Quick Start

1. Install dependencies

2. Pull a model and start nexa serve

3. Start the DOMShell MCP server

4. Run the agent

Usage

Command-line options

Modes

Example tasks

Model Recommendations

Known Limitations

Using Ollama as an alternative backend

How It Works

Troubleshooting

"Cannot reach nexa serve"

"nexa serve has no models loaded"

"Connection refused" on port 3001

"No browser tab found"

Model outputs text instead of JSON

Max turns reached

FilesExpand file tree

nexa

Directory actions

More options

Directory actions

More options

Latest commit

History

nexa

Folders and files

parent directory

README.md

DOMShell + Nexa SDK: Local LLM Browser Automation

Overview

Why DOMShell vs Browser-Use (Playwright)?

Prerequisites

Quick Start

1. Install dependencies

2. Pull a model and start nexa serve

3. Start the DOMShell MCP server

4. Run the agent

Usage

Command-line options

Modes

Example tasks

Model Recommendations

Known Limitations

Using Ollama as an alternative backend

How It Works

Troubleshooting

"Cannot reach nexa serve"

"nexa serve has no models loaded"

"Connection refused" on port 3001

"No browser tab found"

Model outputs text instead of JSON

Max turns reached