Principle:Ucbepic Docetl Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
A data ingestion principle that loads raw data from files or in-memory sources and optionally applies parsing tools to transform unstructured content into structured records.
Description
Data Preparation is the foundational step in any ETL pipeline. Before operations like map, reduce, or resolve can process documents, the raw data must be loaded from its source format (JSON, CSV, Parquet) and optionally transformed through parsing tools. Parsing tools convert unstructured file content (e.g., PDF text, HTML) into structured key-value records suitable for LLM processing.
In DocETL, this principle manifests through the Dataset class, which handles both file-based and in-memory data sources. The class supports pluggable parsing tools that can be chained to progressively transform raw data into the structured format operations expect.
Usage
Apply this principle at the beginning of any data processing pipeline when raw data needs to be loaded from files or external sources. It is especially important when:
- Input data is in structured file formats (JSON, CSV, Parquet)
- Unstructured files (PDFs, DOCX) need parsing into text fields
- Custom parsing logic is required to extract structured records from raw content
Theoretical Basis
Data preparation follows the Extract phase of the ETL pattern:
- Source Identification: Determine data format and location
- Schema Mapping: Map raw fields to expected record structure
- Transformation: Apply parsing tools to convert unstructured content
- Validation: Ensure all records conform to expected schema
# Pseudo-code for data preparation
records = load_from_source(path, format)
for tool in parsing_tools:
records = apply_tool(records, tool)
validate_schema(records)