Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft Playwright Define Agent Actions with Perform

From Leeroopedia
Knowledge Sources
Domains AI_Testing, Browser_Automation, Natural_Language_Processing
Last Updated 2026-02-11 00:00 GMT

Overview

Instructing an AI agent to perform multi-step browser tasks described in natural language enables autonomous browser interaction where the agent selects and executes appropriate actions without explicit programming of each step.

Description

Traditional browser automation requires the test author to specify every interaction explicitly: click this selector, type into that field, wait for this element. AI-agent-driven task execution inverts this model. The test author describes what should happen in natural language, and the agent autonomously determines how to accomplish it.

This principle encompasses:

  • Natural language task specification: The test author writes human-readable instructions like "Fill out the registration form with valid data and submit it" rather than scripting individual clicks and keystrokes.
  • Autonomous action selection: The LLM observes the current page state (via accessibility snapshots or screenshots), reasons about the task, and selects the appropriate browser action from a defined tool set.
  • Multi-step execution: A single natural language task may require multiple sequential browser actions. The agent loops: observe the page, decide the next action, execute it, observe the result, and repeat until the task is complete or limits are reached.
  • Action budgeting: Each task execution is bounded by maximum action counts and token limits to prevent infinite loops and control costs.
  • Retry logic: When an action fails (e.g., element not found, navigation timeout), the agent can retry with adjusted parameters up to a configurable retry limit.

Usage

Apply this principle when:

  • You want to automate browser workflows using natural language instead of explicit selectors
  • The page structure may change frequently, making selector-based tests brittle
  • You need to test complex user journeys that span multiple pages and interactions
  • You want to reduce test maintenance overhead by abstracting away DOM details
  • You are building exploratory tests that adapt to varying page states

Theoretical Basis

The agent task execution follows a ReAct (Reasoning + Acting) loop pattern:

AgentLoop(task):
  history = []
  for step in 1..maxActions:
    observation = snapshot(page)           // Capture current page state
    history.append(observation)

    action = LLM.decide(task, history)     // LLM selects next action

    if action == TASK_COMPLETE:
      return SUCCESS

    try:
      result = execute(page, action)       // Run browser action
      history.append(result)
    catch error:
      if retries < maxActionRetries:
        history.append(error)
        retries += 1
        continue
      else:
        return FAILURE

  return EXCEEDED_ACTION_LIMIT

Tool set design:

The agent operates with a fixed set of browser action tools. Each tool maps to a fundamental browser interaction:

Tool Browser Action Description
browser_navigate page.goto() Navigate to a URL
browser_snapshot accessibility tree Capture page state for observation
browser_click locator.click() Click an element
browser_drag locator.drag() Drag an element to a target
browser_hover locator.hover() Hover over an element
browser_select_option locator.selectOption() Select from a dropdown
browser_press_key keyboard.press() Press a keyboard key
browser_type keyboard.type() Type text character by character
browser_fill_form locator.fill() Fill a form field with text
browser_set_checked locator.setChecked() Set checkbox/radio state

The observation-action cycle:

Each iteration of the loop produces an observation (page snapshot) and consumes it to produce an action. The LLM maintains context through the conversation history, which accumulates observations and action results. This enables the agent to:

  • Track progress toward the task goal
  • Recover from unexpected page states
  • Adapt to dynamic content loading
  • Handle multi-page workflows

Token economy:

Each observation and action decision consumes tokens. The total token budget constrains the complexity of tasks the agent can handle. Efficient agents minimize unnecessary observations and make decisive actions.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment