Skip to content

[FEATURE] Use AppleScripts for MacOS Computer Use when possible #371

@edenreich

Description

@edenreich

Summary

Screenshots are expensive.

I need to rethink the approach of using Vision models for Computer Use, the LLMs can also interact with the Accessibility Tree instead of purely by vision. This will make the system more compatible and token efficient.

Not all OS supports this type of structured data, so the task is to explore the opportunity to make this more efficient.

Perhaps even providing a tool to read the tree and select an element with a fallback that when it fails, it will consume a screenshot (GetLatestScreenshot).

Acceptance Criteria

  • A pre-step tool call for reading and using the Accessibility tree exists to make the operations more efficient
  • Less tokens ingestion is needed
  • It's documented
  • It's tested

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions