Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft Playwright Extract Data with Agent

From Leeroopedia
Knowledge Sources
Domains AI_Testing, Browser_Automation, Data_Extraction
Last Updated 2026-02-11 00:00 GMT

Overview

Using AI agents to extract structured data from web pages by describing the desired data in natural language and providing a schema for the output enables type-safe data retrieval without writing DOM traversal code.

Description

Web pages contain rich, semi-structured data embedded in HTML: tables, lists, cards, forms, and text blocks. Extracting this data programmatically typically requires writing fragile CSS selectors or XPath queries that break when the page structure changes. AI-driven data extraction offers an alternative approach:

  1. The test author describes what data to extract in natural language (e.g., "extract all product names and prices from the catalog")
  2. The test author provides a schema that defines the expected shape of the extracted data (e.g., an array of objects with name and price fields)
  3. The AI agent reads the page content, identifies the relevant data, and returns it in the specified schema format

Key properties of this approach:

  • Schema-driven output: The extraction result is guaranteed to conform to a user-defined schema, providing type safety and enabling downstream processing.
  • No browser actions: Unlike perform() and expect(), data extraction is purely observational. The agent reads the page but does not click, navigate, or modify any state.
  • Natural language flexibility: The query can describe data at any level of abstraction, from simple field extraction to complex aggregation.
  • LLM-powered understanding: The agent uses the LLM's natural language understanding to identify relevant data even when it is presented in varied formats across different page layouts.

Usage

Apply this principle when:

  • You need to extract structured data from a web page for test validation
  • You want to scrape tabular data, lists, or card-based layouts without writing selectors
  • You need type-safe extraction results validated against a schema
  • The page layout may change but the data semantics remain stable
  • You are building data-driven tests that compare extracted data against expected values

Theoretical Basis

AI-driven data extraction can be modeled as a function from page content and a query to structured output:

Extract(page, query, schema):
  content = snapshot(page)          // Capture page content

  prompt = buildPrompt(
    instruction: "Extract the requested data. Do not perform any actions.",
    query: query,
    content: content,
    outputSchema: schema
  )

  response = LLM.generate(prompt)   // LLM produces structured output
  result = validate(response, schema)  // Validate against schema

  return result

Schema as contract:

The schema serves a dual purpose:

  1. LLM guidance: It tells the LLM exactly what structure to produce, reducing ambiguity and improving extraction accuracy.
  2. Runtime validation: The extracted data is validated against the schema before being returned, catching LLM hallucinations or malformed output.
Example schema (conceptual):
{
  type: "array",
  items: {
    type: "object",
    properties: {
      name:  { type: "string" },
      price: { type: "number" },
      inStock: { type: "boolean" }
    }
  }
}

Query: "Extract all products from the catalog page"

Result: [
  { name: "Laptop Stand", price: 29.99, inStock: true },
  { name: "USB-C Hub", price: 49.99, inStock: false },
  ...
]

Distinction from perform() and expect():

Method Purpose Browser Actions Output
perform() Execute browser actions Yes (click, type, navigate) Usage statistics
expect() Verify page state No (read-only) void (pass/fail)
extract() Extract structured data No (read-only) Typed data matching schema

No tool invocation:

Unlike perform() and expect() which use browser tools (action tools and assertion tools respectively), extract() uses no browser tools at all. The LLM is instructed to extract data purely from the page content provided in the prompt. The only "tool" it uses is a built-in report_result tool that returns the structured data conforming to the user's schema.

This design ensures that extraction is always side-effect-free and cannot accidentally modify page state.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment