An MCP server that gives LLMs the ability to see and control a Windows desktop via screenshots, mouse, keyboard, and UI element detection.
The server provides tools — the LLM client does all the reasoning:
- Screenshot the desktop to see what's on screen
- Identify UI elements via vision or Windows UI Automation
- Act using mouse clicks, keyboard input, or hotkeys
- Verify by taking another screenshot after each action
| Layer | Method | Best For |
|---|---|---|
| CDP | Chrome DevTools Protocol via WebSocket | Electron apps (Discord, VS Code, Slack) — requires method="path" launch |
| UIA | Windows UI Automation COM API | Native apps (Notepad, Explorer, Settings, Office) |
| Vision | Screenshot + LLM vision | Everything else — LLM estimates coordinates from the image |
cd Desktop-Control-MCP
pip install -e .Add the server to your MCP client's configuration. Examples for popular clients:
Claude Desktop — %APPDATA%\Claude\claude_desktop_config.json:
{
"mcpServers": {
"desktop-control": {
"command": "python",
"args": ["-m", "desktop_control.server"],
"cwd": "C:\\path\\to\\Desktop-Control-MCP"
}
}
}Claude Code — .claude/settings.json or via claude mcp add:
claude mcp add desktop-control -- python -m desktop_control.serverCursor — .cursor/mcp.json:
{
"mcpServers": {
"desktop-control": {
"command": "python",
"args": ["-m", "desktop_control.server"]
}
}
}OpenAI Agents SDK — via Python:
from agents import Agent
from agents.mcp import MCPServerStdio
server = MCPServerStdio(command="python", args=["-m", "desktop_control.server"])
agent = Agent(name="desktop", tools=server.tools())Any MCP-compatible client (Windsurf, Cline, Continue, etc.) can connect using the standard stdio transport with python -m desktop_control.server.
Then restart your client.
| Tool | Description |
|---|---|
screenshot |
Capture desktop as JPEG. Returns image + screen dimensions. Coordinates map 1:1 to mouse coordinates. |
get_elements |
Get interactive UI elements with exact bounding boxes. Auto-selects CDP or UIA based on the app. |
get_screen_info |
Monitor geometries, cursor position, DPI scale. |
list_open_windows |
All visible windows with titles, process names, and positions. |
| Tool | Description |
|---|---|
mouse_click |
Click at (x, y). Supports left/right/middle, double-click, modifier keys. |
mouse_drag |
Drag from one point to another. |
mouse_scroll |
Scroll vertically or horizontally at a position. |
| Tool | Description |
|---|---|
keyboard_type |
Type text. Handles Unicode via clipboard paste. |
keyboard_hotkey |
Press key combos like ["ctrl", "c"], ["alt", "tab"], ["win"]. |
| Tool | Description |
|---|---|
open_application |
Open an app by name via Start menu search, Win+R, or direct path. |
click_element |
Find a UI element by name and click its center — no coordinate guessing. |
wait |
Pause for loading screens (max 30s). |
src/desktop_control/
server.py # FastMCP server, all tool definitions, entry point
screen.py # DPI awareness init, screenshot capture via mss
mouse.py # Click, drag, scroll via pyautogui
keyboard.py # Text typing with Unicode clipboard fallback, hotkeys
ui_automation.py # Windows UI Automation (native app element detection)
cdp.py # Chrome DevTools Protocol (Electron app element detection)
element_detection.py # Unified facade — auto-selects CDP or UIA
windows.py # App launching, window enumeration, Electron app registry
mcp— MCP Python SDK (FastMCP)pyautogui— Mouse/keyboard controlPillow— Image processingmss— Fast multi-monitor screenshotscomtypes— Windows UI Automation COM accesswebsockets+aiohttp— CDP communication for Electron apps
screen.py calls SetProcessDpiAwareness(2) at module level before any GUI library imports. This ensures mss captures at physical pixel resolution and pyautogui coordinates match screen pixels. The server import order in server.py enforces this.
Screenshots default to monitor_index=1 (primary monitor). Using monitor_index=0 captures the virtual combined monitor which has different dimensions — this caused a ~40% coordinate mismatch on systems where the virtual and primary monitor sizes differ.
Region screenshots include offset instructions so the LLM can map pixel positions back to screen coordinates.
Electron apps (Discord, VS Code, Slack, etc.) expose limited UIA elements. For full DOM access, launch them with open_application(name, method="path") which adds --remote-debugging-port. Known apps and their debug ports are registered in windows.py:ELECTRON_APPS.
pyautogui.FAILSAFE = True— move mouse to top-left corner (0,0) to abortwaitcapped at 30 seconds to prevent hangs- Cannot interact with UAC prompts (secure desktop)
From any MCP client (after registering the server):
"Open Discord and join the General voice channel"
The LLM will:
open_application("Discord")or click the taskbar iconscreenshot()to see the Discord windowmouse_click()on the correct server iconscreenshot()to verify navigationmouse_click()on the voice channelscreenshot()to confirm joined