Principle:Microsoft Playwright Extract Data with Agent
| Knowledge Sources | |
|---|---|
| Domains | AI_Testing, Browser_Automation, Data_Extraction |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
Using AI agents to extract structured data from web pages by describing the desired data in natural language and providing a schema for the output enables type-safe data retrieval without writing DOM traversal code.
Description
Web pages contain rich, semi-structured data embedded in HTML: tables, lists, cards, forms, and text blocks. Extracting this data programmatically typically requires writing fragile CSS selectors or XPath queries that break when the page structure changes. AI-driven data extraction offers an alternative approach:
- The test author describes what data to extract in natural language (e.g., "extract all product names and prices from the catalog")
- The test author provides a schema that defines the expected shape of the extracted data (e.g., an array of objects with name and price fields)
- The AI agent reads the page content, identifies the relevant data, and returns it in the specified schema format
Key properties of this approach:
- Schema-driven output: The extraction result is guaranteed to conform to a user-defined schema, providing type safety and enabling downstream processing.
- No browser actions: Unlike perform() and expect(), data extraction is purely observational. The agent reads the page but does not click, navigate, or modify any state.
- Natural language flexibility: The query can describe data at any level of abstraction, from simple field extraction to complex aggregation.
- LLM-powered understanding: The agent uses the LLM's natural language understanding to identify relevant data even when it is presented in varied formats across different page layouts.
Usage
Apply this principle when:
- You need to extract structured data from a web page for test validation
- You want to scrape tabular data, lists, or card-based layouts without writing selectors
- You need type-safe extraction results validated against a schema
- The page layout may change but the data semantics remain stable
- You are building data-driven tests that compare extracted data against expected values
Theoretical Basis
AI-driven data extraction can be modeled as a function from page content and a query to structured output:
Extract(page, query, schema):
content = snapshot(page) // Capture page content
prompt = buildPrompt(
instruction: "Extract the requested data. Do not perform any actions.",
query: query,
content: content,
outputSchema: schema
)
response = LLM.generate(prompt) // LLM produces structured output
result = validate(response, schema) // Validate against schema
return result
Schema as contract:
The schema serves a dual purpose:
- LLM guidance: It tells the LLM exactly what structure to produce, reducing ambiguity and improving extraction accuracy.
- Runtime validation: The extracted data is validated against the schema before being returned, catching LLM hallucinations or malformed output.
Example schema (conceptual):
{
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
inStock: { type: "boolean" }
}
}
}
Query: "Extract all products from the catalog page"
Result: [
{ name: "Laptop Stand", price: 29.99, inStock: true },
{ name: "USB-C Hub", price: 49.99, inStock: false },
...
]
Distinction from perform() and expect():
| Method | Purpose | Browser Actions | Output |
|---|---|---|---|
| perform() | Execute browser actions | Yes (click, type, navigate) | Usage statistics |
| expect() | Verify page state | No (read-only) | void (pass/fail) |
| extract() | Extract structured data | No (read-only) | Typed data matching schema |
No tool invocation:
Unlike perform() and expect() which use browser tools (action tools and assertion tools respectively), extract() uses no browser tools at all. The LLM is instructed to extract data purely from the page content provided in the prompt. The only "tool" it uses is a built-in report_result tool that returns the structured data conforming to the user's schema.
This design ensures that extraction is always side-effect-free and cannot accidentally modify page state.