Principle:Puppeteer Puppeteer Data Extraction
| Knowledge Sources | |
|---|---|
| Domains | Browser_Automation, Web_Scraping, Data_Processing |
| Last Updated | 2026-02-11 23:00 GMT |
Overview
A technique that executes JavaScript functions within the browser context to extract structured data from the DOM and transfer it to the Node.js environment.
Description
Data Extraction bridges the gap between the browser's JavaScript runtime (where DOM elements live) and the Node.js process (where Puppeteer runs). Since DOM objects cannot be directly serialized across this boundary, Puppeteer provides methods to execute JavaScript functions inside the browser and return their serializable results to Node.js.
Key methods:
- evaluate(): Execute a function in the browser and return the result. Arguments and return values are serialized via JSON.
- $$eval(): Query all elements matching a selector, pass them to a function, and return the result. Useful for extracting data from multiple elements in a single call.
- $eval(): Query a single element and pass it to a function.
The serialization boundary means:
- DOM elements, functions, and Symbols cannot be returned directly
- Return values must be JSON-serializable (strings, numbers, objects, arrays)
- For DOM references, use evaluateHandle() to get a JSHandle or ElementHandle
Usage
Use data extraction after the page has loaded and dynamic content has settled. Prefer $$eval() when extracting data from multiple elements matching a pattern (e.g., all table rows, all search results). Use evaluate() for general-purpose JavaScript execution.
Theoretical Basis
# Data extraction across the serialization boundary
Node.js Process Browser Process
───────────── ───────────────
1. Serialize function →
2. Serialize arguments →
3. Deserialize and execute function
4. Serialize return value
← 5. Transfer result
6. Deserialize result
7. Return to caller