Skip to content

simular-ai/Agent-S

Repository files navigation

Logo Agent S: Use Computer Like a Human

Β  🌐 [S2 blog]Β  πŸ“„ [S2 Paper (COLM 2025)]Β  πŸŽ₯ [S2 Video]

Β  🌐 [S1 blog]Β  πŸ“„ [S1 Paper (ICLR 2025)]Β  πŸŽ₯ [S1 Video]

Β  simular-ai%2FAgent-S | Trendshift

Discord Β Β  PyPI Downloads

πŸ₯³ Updates

  • 2025/08/01: Agent S2.5 is released: simpler, better, and faster! New SOTA on OSWorld-Verified!
  • 2025/07/07: The Agent S2 paper is accepted to COLM 2025! See you in Montreal!
  • 2025/04/01: Released the Agent S2 paper with new SOTA results on OSWorld, WindowsAgentArena, and AndroidWorld!
  • 2025/03/12: Released Agent S2 along with v0.2.0 of gui-agents, the new state-of-the-art for computer use agents (CUA), outperforming OpenAI's CUA/Operator and Anthropic's Claude 3.7 Sonnet Computer-Use!
  • 2025/01/22: The Agent S paper is accepted to ICLR 2025!
  • 2025/01/21: Released v0.1.2 of gui-agents library, with support for Linux and Windows!
  • 2024/12/05: Released v0.1.0 of gui-agents library, allowing you to use Agent-S for Mac, OSWorld, and WindowsAgentArena with ease!
  • 2024/10/10: Released the Agent S paper and codebase!

Table of Contents

  1. πŸ’‘ Introduction
  2. 🎯 Current Results
  3. πŸ› οΈ Installation & Setup
  4. πŸš€ Usage
  5. 🀝 Acknowledgements
  6. πŸ’¬ Citation

πŸ’‘ Introduction

Welcome to Agent S, an open-source framework designed to enable autonomous interaction with computers through Agent-Computer Interface. Our mission is to build intelligent GUI agents that can learn from past experiences and perform complex tasks autonomously on your computer.

Whether you're interested in AI, automation, or contributing to cutting-edge agent-based systems, we're excited to have you here!

🎯 Current Results

Benchmark Agent S2.5 Previous SOTA
OSWorld Verified (100 step) 56.0% 53.1%
OSWorld Verified (50 step) 54.2% 50.6%

πŸ› οΈ Installation & Setup

Note: Our agent returns pyautogui code and is intended for a single monitor screen.

❗Warning❗: If you are on a Linux machine, creating a conda environment will interfere with pyatspi. As of now, there's no clean solution for this issue. Proceed through the installation without using conda or any virtual environment.

⚠️Disclaimer⚠️: To leverage the full potential of Agent S2, we utilize UI-TARS as a grounding model (7B-DPO or 72B-DPO for better performance). They can be hosted locally, or on Hugging Face Inference Endpoints. Our code supports Hugging Face Inference Endpoints. Check out Hugging Face Inference Endpoints for more information on how to set up and query this endpoint. However, running Agent S2 does not require this model, and you can use alternative API based models for visual grounding, such as Claude.

Install the package:

pip install gui-agents

Set your LLM API Keys and other environment variables. You can do this by adding the following line to your .bashrc (Linux), or .zshrc (MacOS) file.

export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
export HF_TOKEN=<YOUR_HF_TOKEN>

Alternatively, you can set the environment variable in your Python script:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

We also support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. For more information refer to models.md.

❗Warning❗: The agent will directly run python code to control your computer. Please use with care.

πŸš€ Usage

Note: Our best configuration uses o3 and UI-TARS-1.5-7B.

CLI

Run Agent S2 with a specific model (default is gpt-4o):

agent_s2 \
  --provider "anthropic" \
  --model "claude-3-7-sonnet-20250219" \
  --grounding_model_provider "anthropic" \
  --grounding_model "claude-3-7-sonnet-20250219" \

Or use a custom endpoint:

agent_s2 \
  --provider "anthropic" \
  --model "claude-3-7-sonnet-20250219" \
  --endpoint_provider "huggingface" \
  --endpoint_url "<endpoint_url>/v1/"

Main Model Settings

  • --provider, --model
    • Purpose: Specifies the main generation model
    • Supports: all model providers in models.md
    • Default: --provider "anthropic" --model "claude-3-7-sonnet-20250219"
  • --model_url, --model_api_key
    • Purpose: Specifies the custom endpoint for the main generation model and your API key
    • Note: These are optional. If not specified, gui-agents will default to your environment variables for the URL and API key.
    • Supports: all model providers in models.md
    • Default: None

Grounding Configuration Options

You can use either Configuration 1 or Configuration 2:

(Default) Configuration 1: API-Based Models
  • --grounding_model_provider, --grounding_model
    • Purpose: Specifies the model for visual grounding (coordinate prediction)
    • Supports: all model providers in models.md
    • Default: --grounding_model_provider "anthropic" --grounding_model "claude-3-7-sonnet-20250219"
  • ❗Important❗ --grounding_model_resize_width
    • Purpose: Some API providers automatically rescale images. Therefore, the generated (x, y) will be relative to the rescaled image dimensions, instead of the original image dimensions.
    • Supports: Anthropic rescaling
    • Tips: If your grounding is inaccurate even for very simple queries, double check your rescaling width is correct for your machine's resolution.
    • Default: --grounding_model_resize_width 1366 (Anthropic)
Configuration 2: Custom Endpoint
  • --endpoint_provider

    • Purpose: Specifies the endpoint provider
    • Supports: HuggingFace TGI, vLLM, Open Router
    • Default: None
  • --endpoint_url

    • Purpose: The URL for your custom endpoint
    • Default: None
  • --endpoint_api_key

    • Purpose: Your API key for your custom endpoint
    • Note: This is optional. If not specified, gui-agents will default to your environment variables for the API key.
    • Default: None

Note: Configuration 2 takes precedence over Configuration 1.

This will show a user query prompt where you can enter your query and interact with Agent S2. You can use any model from the list of supported models in models.md.

gui_agents SDK

First, we import the necessary modules. AgentS2 is the main agent class for Agent S2. OSWorldACI is our grounding agent that translates agent actions into executable python code.

import pyautogui
import io
from gui_agents.s2.agents.agent_s import AgentS2
from gui_agents.s2.agents.grounding import OSWorldACI

# Load in your API keys.
from dotenv import load_dotenv
load_dotenv()

current_platform = "linux"  # "darwin", "windows"

Next, we define our engine parameters. engine_params is used for the main agent, and engine_params_for_grounding is for grounding. For engine_params_for_grounding, we support the Claude, GPT series, and Hugging Face Inference Endpoints.

engine_params = {
  "engine_type": provider,
  "model": model,
  "base_url": model_url,     # Optional
  "api_key": model_api_key,  # Optional
}

# Grounding Configuration 1: Load the grounding engine from an API based model
grounding_model_provider = "<your_grounding_model_provider>"
grounding_model = "<your_grounding_model>"
grounding_model_resize_width = 1366
screen_width, screen_height = pyautogui.size()

engine_params_for_grounding = {
  "engine_type": grounding_model_provider,
  "model": grounding_model,
  "grounding_width": grounding_model_resize_width,
  "grounding_height": screen_height
  * grounding_model_resize_width
  / screen_width,
}

# Grounding Configuration 2: Load the grounding engine from a HuggingFace TGI endpoint
endpoint_provider = "<your_endpoint_provider>"
endpoint_url = "<your_endpoint_url>"
endpoint_api_key = "<your_api_key>"

engine_params_for_grounding = {
  "engine_type": endpoint_provider,
  "base_url": endpoint_url,
  "api_key": endpoint_api_key,  # Optional
}

Then, we define our grounding agent and Agent S2.

grounding_agent = OSWorldACI(
    platform=current_platform,
    engine_params_for_generation=engine_params,
    engine_params_for_grounding=engine_params_for_grounding
)

agent = AgentS2(
  engine_params,
  grounding_agent,
  platform=current_platform,
  action_space="pyautogui",
  observation_type="screenshot",
  search_engine="Perplexica",  # Assuming you have set up Perplexica.
  embedding_engine_type="openai"  # Supports "gemini", "openai"
)

Finally, let's query the agent!

# Get screenshot.
screenshot = pyautogui.screenshot()
buffered = io.BytesIO() 
screenshot.save(buffered, format="PNG")
screenshot_bytes = buffered.getvalue()

obs = {
  "screenshot": screenshot_bytes,
}

instruction = "Close VS Code"
info, action = agent.predict(instruction=instruction, observation=obs)

exec(action[0])

Refer to gui_agents/s2/cli_app.py for more details on how the inference loop works.

OSWorld

To deploy Agent S2 in OSWorld, follow the OSWorld Deployment instructions.

WindowsAgentArena

To deploy Agent S2 in WindowsAgentArena, follow the WindowsAgentArena Deployment Instructions.

πŸ’¬ Citations

If you find this codebase useful, please cite

@misc{Agent-S2,
      title={Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents}, 
      author={Saaket Agashe and Kyle Wong and Vincent Tu and Jiachen Yang and Ang Li and Xin Eric Wang},
      year={2025},
      eprint={2504.00906},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2504.00906}, 
}

@inproceedings{Agent-S,
    title={{Agent S: An Open Agentic Framework that Uses Computers Like a Human}},
    author={Saaket Agashe and Jiuzhou Han and Shuyu Gan and Jiachen Yang and Ang Li and Xin Eric Wang},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2025},
    url={https://arxiv.org/abs/2410.08164}
}

Star History

Star History Chart