Principle:Microsoft Playwright Define Agent Actions with Perform
| Knowledge Sources | |
|---|---|
| Domains | AI_Testing, Browser_Automation, Natural_Language_Processing |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Instructing an AI agent to perform multi-step browser tasks described in natural language enables autonomous browser interaction where the agent selects and executes appropriate actions without explicit programming of each step.
Description
Traditional browser automation requires the test author to specify every interaction explicitly: click this selector, type into that field, wait for this element. AI-agent-driven task execution inverts this model. The test author describes what should happen in natural language, and the agent autonomously determines how to accomplish it.
This principle encompasses:
- Natural language task specification: The test author writes human-readable instructions like "Fill out the registration form with valid data and submit it" rather than scripting individual clicks and keystrokes.
- Autonomous action selection: The LLM observes the current page state (via accessibility snapshots or screenshots), reasons about the task, and selects the appropriate browser action from a defined tool set.
- Multi-step execution: A single natural language task may require multiple sequential browser actions. The agent loops: observe the page, decide the next action, execute it, observe the result, and repeat until the task is complete or limits are reached.
- Action budgeting: Each task execution is bounded by maximum action counts and token limits to prevent infinite loops and control costs.
- Retry logic: When an action fails (e.g., element not found, navigation timeout), the agent can retry with adjusted parameters up to a configurable retry limit.
Usage
Apply this principle when:
- You want to automate browser workflows using natural language instead of explicit selectors
- The page structure may change frequently, making selector-based tests brittle
- You need to test complex user journeys that span multiple pages and interactions
- You want to reduce test maintenance overhead by abstracting away DOM details
- You are building exploratory tests that adapt to varying page states
Theoretical Basis
The agent task execution follows a ReAct (Reasoning + Acting) loop pattern:
AgentLoop(task):
history = []
for step in 1..maxActions:
observation = snapshot(page) // Capture current page state
history.append(observation)
action = LLM.decide(task, history) // LLM selects next action
if action == TASK_COMPLETE:
return SUCCESS
try:
result = execute(page, action) // Run browser action
history.append(result)
catch error:
if retries < maxActionRetries:
history.append(error)
retries += 1
continue
else:
return FAILURE
return EXCEEDED_ACTION_LIMIT
Tool set design:
The agent operates with a fixed set of browser action tools. Each tool maps to a fundamental browser interaction:
| Tool | Browser Action | Description |
|---|---|---|
| browser_navigate | page.goto() | Navigate to a URL |
| browser_snapshot | accessibility tree | Capture page state for observation |
| browser_click | locator.click() | Click an element |
| browser_drag | locator.drag() | Drag an element to a target |
| browser_hover | locator.hover() | Hover over an element |
| browser_select_option | locator.selectOption() | Select from a dropdown |
| browser_press_key | keyboard.press() | Press a keyboard key |
| browser_type | keyboard.type() | Type text character by character |
| browser_fill_form | locator.fill() | Fill a form field with text |
| browser_set_checked | locator.setChecked() | Set checkbox/radio state |
The observation-action cycle:
Each iteration of the loop produces an observation (page snapshot) and consumes it to produce an action. The LLM maintains context through the conversation history, which accumulates observations and action results. This enables the agent to:
- Track progress toward the task goal
- Recover from unexpected page states
- Adapt to dynamic content loading
- Handle multi-page workflows
Token economy:
Each observation and action decision consumes tokens. The total token budget constrains the complexity of tasks the agent can handle. Efficient agents minimize unnecessary observations and make decisive actions.