Implementation:Ucbepic Docetl Dataset Load
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete tool for loading and parsing datasets provided by the DocETL framework.
Description
The Dataset class in DocETL handles loading data from JSON, CSV, or Parquet files (or in-memory lists/DataFrames) and optionally applying a chain of parsing tools that transform raw content into structured records. It supports both built-in parsers (for PDFs, HTML, etc.) and user-defined custom parsing functions.
Usage
Import and use this class when you need to load raw data into a DocETL pipeline. In YAML pipelines, datasets are declared in the datasets section. In the Python API, Dataset objects are passed to the Pipeline constructor.
Code Reference
Source Location
- Repository: docetl
- File: docetl/dataset.py
- Lines: L81-315
Signature
class Dataset:
def __init__(
self,
runner,
type: str,
path_or_data: str | list[dict],
source: str = "local",
parsing: list[dict[str, str]] = None,
user_defined_parsing_tool_map: dict[str, ParsingTool] = {},
):
"""
Args:
runner: The pipeline runner instance.
type: Dataset type ('file' or 'memory').
path_or_data: File path (str) or inline data (list[dict]).
source: Data source ('local' or 's3').
parsing: List of parsing tool configurations.
user_defined_parsing_tool_map: Map of user-defined parsing tools.
"""
def load(self) -> list[dict]:
"""Load dataset from path or return in-memory data."""
Import
from docetl.dataset import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| type | str | Yes | File format: "json", "csv", or "parquet" (for file type) or "memory" |
| path_or_data | str or list[dict] | Yes | File path string or inline data list |
| source | str | No | Data source location, defaults to "local" |
| parsing | list[dict] | No | Parsing tool chain configurations |
| user_defined_parsing_tool_map | dict | No | Custom parsing tool implementations |
Outputs
| Name | Type | Description |
|---|---|---|
| load() returns | list[dict] | Loaded and parsed dataset records |
Usage Examples
YAML Dataset Declaration
datasets:
input:
type: file
path: data/documents.json
source: local
parsing:
- function: txt_to_string
input_key: file_path
output_key: content
Python API Usage
from docetl.schemas import Dataset
dataset = Dataset(
type="file",
path="data/documents.json",
source="local",
)