Skip to content

Commit 380d9af

Browse files
tdiminoclaude
andcommitted
docs: add Claude Code skill for desktop GUI automation
Add a community-contributed Claude Code skill that wraps Open Interpreter's Computer API (pyautogui, pytesseract) for desktop GUI automation. Provides standalone scripts for screenshot capture, mouse clicking (by coordinates or OCR text), keyboard input, and screen text detection. Three integration modes: - Library: Claude Code reasons from screenshots, dispatches actions via scripts - OS subprocess: delegates entire GUI tasks to OI's --os agent loop - Local agent: offline computer use via Ollama No changes to Open Interpreter's source code or package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 681f5ce commit 380d9af

File tree

13 files changed

+1892
-0
lines changed

13 files changed

+1892
-0
lines changed
Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# open-interpreter — Claude Code Skill
2+
3+
A [Claude Code skill](https://code.claude.com/docs/en/skills) for desktop GUI automation, built on top of Open Interpreter's Computer API. Provides mouse, keyboard, screenshot, and OCR control for native macOS/Linux applications that have no CLI or API.
4+
5+
## What is this?
6+
7+
[Claude Code](https://github.com/anthropics/claude-code) is Anthropic's terminal-based AI coding tool. It reads `.claude/skills/` directories for specialized capabilities. This skill gives Claude Code the ability to interact with desktop GUIs by wrapping Open Interpreter's pyautogui + pytesseract primitives in standalone scripts.
8+
9+
## When to Use
10+
11+
- Interacting with desktop apps (System Preferences, Calculator, browsers, any GUI)
12+
- Automating GUI workflows (form filling, menu navigation, data extraction)
13+
- Reading screen content via OCR (finding buttons, labels, prices, status text)
14+
- Controlling mouse and keyboard programmatically
15+
16+
## Modes
17+
18+
| Mode | LLM | Script | Best For |
19+
|------|-----|--------|----------|
20+
| **Library** | Claude Code (native) | Individual scripts | Surgical GUI actions — Claude sees screenshots, reasons, dispatches |
21+
| **OS subprocess** | Claude API (via OI) | `oi_os_mode.py` | Delegating entire GUI tasks to OI's agent loop |
22+
| **Local agent** | Ollama (offline) | `oi_os_mode.py --local` | Offline computer use, no API costs |
23+
24+
Use Library mode by default. OS subprocess for self-contained GUI tasks. Local agent when offline.
25+
26+
## Prerequisites
27+
28+
- Python 3.10+
29+
- [uv](https://github.com/astral-sh/uv) package manager
30+
- macOS: Accessibility + Screen Recording permissions for terminal app
31+
- tesseract (`brew install tesseract`)
32+
33+
## Installation
34+
35+
To use this skill, copy the folder into your Claude Code skills directory:
36+
37+
```bash
38+
cp -r .claude/skills/open-interpreter ~/.claude/skills/open-interpreter
39+
```
40+
41+
Then run the install script:
42+
43+
```bash
44+
~/.claude/skills/open-interpreter/scripts/oi_install.sh
45+
```
46+
47+
Verify permissions:
48+
49+
```bash
50+
python3 ~/.claude/skills/open-interpreter/scripts/oi_permission_check.py
51+
```
52+
53+
## Directory Structure
54+
55+
```
56+
open-interpreter/
57+
├── SKILL.md # Skill instructions for Claude Code
58+
├── README.md # This file
59+
├── scripts/
60+
│ ├── oi_install.sh # One-shot install + permissions check
61+
│ ├── oi_screenshot.py # Screen capture with Retina metadata
62+
│ ├── oi_click.py # Mouse click by coordinates or OCR text
63+
│ ├── oi_type.py # Keyboard input, hotkeys, key presses
64+
│ ├── oi_find_text.py # OCR: find text on screen → JSON coords
65+
│ ├── oi_computer.py # Unified dispatch for all actions
66+
│ ├── oi_os_mode.py # Launch OI as managed subprocess
67+
│ └── oi_permission_check.py # Check macOS permissions
68+
└── references/
69+
├── computer-api.md # OI Computer API reference
70+
├── os-mode.md # OS Mode usage and architecture
71+
└── safety-and-permissions.md # Permissions guide and safety model
72+
```
73+
74+
## Scripts
75+
76+
### oi_screenshot.py — Screen capture
77+
78+
```bash
79+
python3 scripts/oi_screenshot.py # Full screen
80+
python3 scripts/oi_screenshot.py --region 0,0,800,600 # Region
81+
python3 scripts/oi_screenshot.py --active-window # Active window only
82+
```
83+
84+
Outputs file path + `SCALE_FACTOR` + `SCREEN_SIZE` metadata (3 lines to stdout).
85+
86+
### oi_click.py — Mouse click
87+
88+
```bash
89+
python3 scripts/oi_click.py --x 450 --y 300 # Coordinate click
90+
python3 scripts/oi_click.py --x 900 --y 600 --image-coords # Auto-divide by Retina scale
91+
python3 scripts/oi_click.py --text "Submit" # OCR: find and click text
92+
python3 scripts/oi_click.py --x 450 --y 300 --double # Double click
93+
python3 scripts/oi_click.py --x 450 --y 300 --right # Right click
94+
```
95+
96+
### oi_type.py — Keyboard input
97+
98+
```bash
99+
python3 scripts/oi_type.py --text "hello world" # Clipboard-paste (default)
100+
python3 scripts/oi_type.py --key enter # Single key press
101+
python3 scripts/oi_type.py --hotkey command space # Hotkey (AppleScript on macOS)
102+
python3 scripts/oi_type.py --text "search" --method typewrite # Character-by-character
103+
```
104+
105+
### oi_find_text.py — OCR screen reading
106+
107+
```bash
108+
python3 scripts/oi_find_text.py --text "Submit"
109+
python3 scripts/oi_find_text.py --text "Price" --all --min-conf 80
110+
```
111+
112+
Returns JSON: `[{"text": "Submit", "x": 450, "y": 300, "w": 80, "h": 24, "confidence": 95}]`
113+
114+
### oi_computer.py — Unified dispatch
115+
116+
```bash
117+
python3 scripts/oi_computer.py screenshot
118+
python3 scripts/oi_computer.py click --x 450 --y 300
119+
python3 scripts/oi_computer.py type --text "hello"
120+
python3 scripts/oi_computer.py find --text "Submit"
121+
python3 scripts/oi_computer.py scroll --clicks 3
122+
python3 scripts/oi_computer.py mouse-position
123+
python3 scripts/oi_computer.py screen-size
124+
```
125+
126+
### oi_os_mode.py — Delegate full GUI tasks
127+
128+
```bash
129+
python3 scripts/oi_os_mode.py "Open Calculator and compute 2+2"
130+
python3 scripts/oi_os_mode.py --local "What apps are open?" # Ollama (offline)
131+
```
132+
133+
## Quick Examples
134+
135+
### Open an app via Spotlight
136+
137+
```bash
138+
python3 scripts/oi_type.py --hotkey command space
139+
sleep 0.5
140+
python3 scripts/oi_type.py --text "Calculator"
141+
sleep 0.3
142+
python3 scripts/oi_type.py --key enter
143+
```
144+
145+
### Click a button by label
146+
147+
```bash
148+
python3 scripts/oi_click.py --text "Save"
149+
```
150+
151+
### Read text from screen
152+
153+
```bash
154+
python3 scripts/oi_find_text.py --text "Total" --all
155+
```
156+
157+
### Fill a form
158+
159+
```bash
160+
python3 scripts/oi_click.py --text "Email"
161+
python3 scripts/oi_type.py --text "user@example.com"
162+
python3 scripts/oi_type.py --key tab
163+
python3 scripts/oi_type.py --text "password123"
164+
```
165+
166+
## Retina Display Handling
167+
168+
macOS Retina displays render at 2x scaling. Screenshot image pixels differ from pyautogui screen coordinates. Use `--image-coords` on `oi_click.py` to auto-divide coordinates by the scale factor when targeting positions from screenshot pixels.
169+
170+
## Safety
171+
172+
1. Confirm with user before clicking Send, Delete, Submit, or Confirm buttons
173+
2. Screenshot before and after every action for verification
174+
3. No unbounded autonomous loops
175+
4. pyautogui failsafe: moving mouse to screen corner raises exception
176+
5. Every script logs actions to stderr: `[oi] click at (450, 300) button=left`
177+
178+
## Troubleshooting
179+
180+
| Symptom | Fix |
181+
|---------|-----|
182+
| Black screenshot | Grant Screen Recording permission to terminal app |
183+
| Click/type no effect | Grant Accessibility permission to terminal app |
184+
| OCR finds no text | Verify tesseract: `which tesseract && tesseract --version` |
185+
| Coordinates off by 2x | Use `--image-coords` flag on `oi_click.py` |
186+
| OS Mode hangs | Verify `ANTHROPIC_API_KEY` is set |
187+
| Local mode fails | Verify Ollama running: `ollama list` |
188+
189+
## Credits
190+
191+
- [OpenInterpreter](https://github.com/OpenInterpreter/open-interpreter) by Killian Lucas — the foundation this skill builds on
192+
- [Claudicle](https://github.com/tdimino/claudicle) by Tom di Mino — open-source soul agent framework, LLM-agnostic at the cognitive level
193+
- Built as a [Claude Code skill](https://code.claude.com/docs/en/skills) following the [Agent Skills](https://agentskills.io/) open standard

0 commit comments

Comments
 (0)