Principle:Openai Openai agents python Computer Use
| Knowledge Sources | |
|---|---|
| Domains | Computer_Use, Agent_Architecture |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Mechanism that enables an agent to interact with a graphical user interface through screenshot observation and mouse/keyboard actions.
Description
Computer Use allows an LLM agent to control a computer environment by observing screenshots and issuing structured actions (click, type, scroll, keypress, drag). The SDK defines an AsyncComputer abstract interface that implementations must satisfy, providing methods for screenshot capture and all supported input actions (click, double_click, scroll, type, wait, move, keypress, drag).
The architecture separates the computer abstraction from the agent via ComputerTool, which wraps an `AsyncComputer` instance (or a `ComputerProvider` factory) and exposes it as a tool the agent can call. The model (typically `computer-use-preview`) receives screenshots as base64-encoded images and returns structured action commands.
Two lifecycle patterns are supported: singleton (a single shared `AsyncComputer` instance for all runs) and per-request (a `ComputerProvider` factory that creates and disposes computer instances per run context). The per-request pattern avoids state leakage between concurrent agent runs.
Usage
Use this principle when building agents that need to interact with graphical interfaces such as web browsers, desktop applications, or any visual environment. Implement the `AsyncComputer` interface for your specific environment (Playwright for browsers, VNC for desktops, etc.). Use `ComputerProvider` for production scenarios requiring isolation between concurrent runs.
Theoretical Basis
Computer Use implements a perception-action loop adapted for GUI interaction:
Pseudo-code Logic:
# Abstract computer-use loop
while not done:
screenshot = await computer.screenshot() # base64 image
actions = model.decide(screenshot, goal) # structured actions
for action in actions:
if action.type == "click":
await computer.click(action.x, action.y, action.button)
elif action.type == "type":
await computer.type(action.text)
elif action.type == "scroll":
await computer.scroll(action.x, action.y, action.dx, action.dy)
# ... other action types
screenshot = await computer.screenshot() # observe result
The key design decisions are:
- Abstract interface: `AsyncComputer` decouples the agent from specific computer implementations.
- Provider pattern: `ComputerProvider` enables per-request lifecycle management.
- Structured actions: Actions are typed (click, type, scroll, etc.) rather than raw commands.