Β π [S2 blog]Β π [S2 Paper (COLM 2025)]Β π₯ [S2 Video]
Β π [S1 blog]Β π [S1 Paper (ICLR 2025)]Β π₯ [S1 Video]
- 2025/08/01: Agent S2.5 is released: simpler, better, and faster! New SOTA on OSWorld-Verified!
- 2025/07/07: The Agent S2 paper is accepted to COLM 2025! See you in Montreal!
- 2025/04/01: Released the Agent S2 paper with new SOTA results on OSWorld, WindowsAgentArena, and AndroidWorld!
- 2025/03/12: Released Agent S2 along with v0.2.0 of gui-agents, the new state-of-the-art for computer use agents (CUA), outperforming OpenAI's CUA/Operator and Anthropic's Claude 3.7 Sonnet Computer-Use!
- 2025/01/22: The Agent S paper is accepted to ICLR 2025!
- 2025/01/21: Released v0.1.2 of gui-agents library, with support for Linux and Windows!
- 2024/12/05: Released v0.1.0 of gui-agents library, allowing you to use Agent-S for Mac, OSWorld, and WindowsAgentArena with ease!
- 2024/10/10: Released the Agent S paper and codebase!
- π‘ Introduction
- π― Current Results
- π οΈ Installation & Setup
- π Usage
- π€ Acknowledgements
- π¬ Citation
Welcome to Agent S, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer.
Whether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!
Benchmark | Agent S2.5 | Previous SOTA |
---|---|---|
OSWorld Verified (100 step) | 56.0% | 53.1% |
OSWorld Verified (50 step) | 54.2% | 50.6% |
Note: Our agent returns
pyautogui
code and is intended for a single monitor screen.
βWarningβ: If you are on a Linux machine, creating a
conda
environment will interfere withpyatspi
. As of now, there's no clean solution for this issue. Proceed through the installation without usingconda
or any virtual environment.
β οΈ Disclaimerβ οΈ : To leverage the full potential of Agent S2, we utilize UI-TARS as a grounding model (7B-DPO or 72B-DPO for better performance). They can be hosted locally, or on Hugging Face Inference Endpoints. Our code supports Hugging Face Inference Endpoints. Check out Hugging Face Inference Endpoints for more information on how to set up and query this endpoint. However, running Agent S2 does not require this model, and you can use alternative API based models for visual grounding, such as Claude.
Install the package:
pip install gui-agents
Set your LLM API Keys and other environment variables. You can do this by adding the following line to your .bashrc (Linux), or .zshrc (MacOS) file.
export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>
Alternatively, you can set the environment variable in your Python script:
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
We also support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. For more information refer to models.md.
βWarningβ: The agent will directly run python code to control your computer. Please use with care.
Note: Our best configuration uses o3 and UI-TARS-1.5-7B.
Run Agent S2 with a specific model (default is gpt-4o
):
agent_s2 \
--provider "anthropic" \
--model "claude-3-7-sonnet-20250219" \
--grounding_model_provider "anthropic" \
--grounding_model "claude-3-7-sonnet-20250219" \
Or use a custom endpoint:
agent_s2 \
--provider "anthropic" \
--model "claude-3-7-sonnet-20250219" \
--endpoint_provider "huggingface" \
--endpoint_url "<endpoint_url>/v1/"
--provider
,--model
- Purpose: Specifies the main generation model
- Supports: all model providers in models.md
- Default:
--provider "anthropic" --model "claude-3-7-sonnet-20250219"
--model_url
,--model_api_key
- Purpose: Specifies the custom endpoint for the main generation model and your API key
- Note: These are optional. If not specified,
gui-agents
will default to your environment variables for the URL and API key. - Supports: all model providers in models.md
- Default: None
You can use either Configuration 1 or Configuration 2:
--grounding_model_provider
,--grounding_model
- Purpose: Specifies the model for visual grounding (coordinate prediction)
- Supports: all model providers in models.md
- Default:
--grounding_model_provider "anthropic" --grounding_model "claude-3-7-sonnet-20250219"
- βImportantβ
--grounding_model_resize_width
- Purpose: Some API providers automatically rescale images. Therefore, the generated (x, y) will be relative to the rescaled image dimensions, instead of the original image dimensions.
- Supports: Anthropic rescaling
- Tips: If your grounding is inaccurate even for very simple queries, double check your rescaling width is correct for your machine's resolution.
- Default:
--grounding_model_resize_width 1366
(Anthropic)
-
--endpoint_provider
- Purpose: Specifies the endpoint provider
- Supports: HuggingFace TGI, vLLM, Open Router
- Default: None
-
--endpoint_url
- Purpose: The URL for your custom endpoint
- Default: None
-
--endpoint_api_key
- Purpose: Your API key for your custom endpoint
- Note: This is optional. If not specified,
gui-agents
will default to your environment variables for the API key. - Default: None
Note: Configuration 2 takes precedence over Configuration 1.
This will show a user query prompt where you can enter your query and interact with Agent S2. You can use any model from the list of supported models in models.md.
First, we import the necessary modules. AgentS2
is the main agent class for Agent S2. OSWorldACI
is our grounding agent that translates agent actions into executable python code.
import pyautogui
import io
from gui_agents.s2.agents.agent_s import AgentS2
from gui_agents.s2.agents.grounding import OSWorldACI
# Load in your API keys.
from dotenv import load_dotenv
load_dotenv()
current_platform = "linux" # "darwin", "windows"
Next, we define our engine parameters. engine_params
is used for the main agent, and engine_params_for_grounding
is for grounding. For engine_params_for_grounding
, we support the Claude, GPT series, and Hugging Face Inference Endpoints.
engine_params = {
"engine_type": provider,
"model": model,
"base_url": model_url, # Optional
"api_key": model_api_key, # Optional
}
# Grounding Configuration 1: Load the grounding engine from an API based model
grounding_model_provider = "<your_grounding_model_provider>"
grounding_model = "<your_grounding_model>"
grounding_model_resize_width = 1366
screen_width, screen_height = pyautogui.size()
engine_params_for_grounding = {
"engine_type": grounding_model_provider,
"model": grounding_model,
"grounding_width": grounding_model_resize_width,
"grounding_height": screen_height
* grounding_model_resize_width
/ screen_width,
}
# Grounding Configuration 2: Load the grounding engine from a HuggingFace TGI endpoint
endpoint_provider = "<your_endpoint_provider>"
endpoint_url = "<your_endpoint_url>"
endpoint_api_key = "<your_api_key>"
engine_params_for_grounding = {
"engine_type": endpoint_provider,
"base_url": endpoint_url,
"api_key": endpoint_api_key, # Optional
}
Then, we define our grounding agent and Agent S2.
grounding_agent = OSWorldACI(
platform=current_platform,
engine_params_for_generation=engine_params,
engine_params_for_grounding=engine_params_for_grounding
)
agent = AgentS2(
engine_params,
grounding_agent,
platform=current_platform,
action_space="pyautogui",
observation_type="screenshot",
search_engine="Perplexica", # Assuming you have set up Perplexica.
embedding_engine_type="openai" # Supports "gemini", "openai"
)
Finally, let's query the agent!
# Get screenshot.
screenshot = pyautogui.screenshot()
buffered = io.BytesIO()
screenshot.save(buffered, format="PNG")
screenshot_bytes = buffered.getvalue()
obs = {
"screenshot": screenshot_bytes,
}
instruction = "Close VS Code"
info, action = agent.predict(instruction=instruction, observation=obs)
exec(action[0])
Refer to gui_agents/s2/cli_app.py
for more details on how the inference loop works.
To deploy Agent S2 in OSWorld, follow the OSWorld Deployment instructions.
To deploy Agent S2 in WindowsAgentArena, follow the WindowsAgentArena Deployment Instructions.
If you find this codebase useful, please cite
@misc{Agent-S2,
title={Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents},
author={Saaket Agashe and Kyle Wong and Vincent Tu and Jiachen Yang and Ang Li and Xin Eric Wang},
year={2025},
eprint={2504.00906},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.00906},
}
@inproceedings{Agent-S,
title={{Agent S: An Open Agentic Framework that Uses Computers Like a Human}},
author={Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025},
url={https://arxiv.org/abs/2410.08164}
}