-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Screenshots are expensive.
I need to rethink the approach of using Vision models for Computer Use, the LLMs can also interact with the Accessibility Tree instead of purely by vision. This will make the system more compatible and token efficient.
Not all OS supports this type of structured data, so the task is to explore the opportunity to make this more efficient.
Perhaps even providing a tool to read the tree and select an element with a fallback that when it fails, it will consume a screenshot (GetLatestScreenshot).
Acceptance Criteria
- A pre-step tool call for reading and using the Accessibility tree exists to make the operations more efficient
- Less tokens ingestion is needed
- It's documented
- It's tested
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Todo