Principle:Ucbepic Docetl Data Preparation

Knowledge Sources	DocETL Docs DocETL
Domains	Data_Engineering, ETL
Last Updated	2026-02-08 01:40 GMT

Overview

A data ingestion principle that loads raw data from files or in-memory sources and optionally applies parsing tools to transform unstructured content into structured records.

Description

Data Preparation is the foundational step in any ETL pipeline. Before operations like map, reduce, or resolve can process documents, the raw data must be loaded from its source format (JSON, CSV, Parquet) and optionally transformed through parsing tools. Parsing tools convert unstructured file content (e.g., PDF text, HTML) into structured key-value records suitable for LLM processing.

In DocETL, this principle manifests through the Dataset class, which handles both file-based and in-memory data sources. The class supports pluggable parsing tools that can be chained to progressively transform raw data into the structured format operations expect.

Usage

Apply this principle at the beginning of any data processing pipeline when raw data needs to be loaded from files or external sources. It is especially important when:

Input data is in structured file formats (JSON, CSV, Parquet)
Unstructured files (PDFs, DOCX) need parsing into text fields
Custom parsing logic is required to extract structured records from raw content

Theoretical Basis

Data preparation follows the Extract phase of the ETL pattern:

Source Identification: Determine data format and location
Schema Mapping: Map raw fields to expected record structure
Transformation: Apply parsing tools to convert unstructured content
Validation: Ensure all records conform to expected schema

# Pseudo-code for data preparation
records = load_from_source(path, format)
for tool in parsing_tools:
    records = apply_tool(records, tool)
validate_schema(records)

Related Pages

Implemented By

Implementation:Ucbepic_Docetl_Dataset_Load

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment