Implementation:Ucbepic Docetl Directive DocChunking

Knowledge Sources	Ucbepic_Docetl
Domains	Pipeline_Optimization, LLM_Operations
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for transforming a single Map operation into a full chunking pipeline (Split, Gather, Map, Reduce) provided by the DocETL reasoning optimizer.

Description

The DocumentChunkingDirective class transforms a single Map operation into a chunking pipeline: splits long documents into chunks, gathers context around each chunk, optionally samples a subset of chunks for efficiency, processes chunks with a new Map operation, then reduces the results. By default, sampling is applied unless the task requires processing all chunks. This directive can only be applied to a top-level Map operation, not to a sub-map within an existing split/gather/reduce pipeline.

Usage

The MOAR agent applies this directive when processing long documents that need to extract information but the document is too long for a single Map operation. The agent automatically decides whether to sample chunks (for categorization, theme extraction) or process all chunks (for comprehensive extraction).

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: docetl/reasoning_optimizer/directives/doc_chunking.py
Lines: 1-466

Signature

class DocumentChunkingDirective(Directive):
    name = "doc_chunking"
    description = "Transforms a single Map operation into a chunking pipeline: Split -> Gather -> [Sample] -> Map -> Reduce."

    def check_applicability(self, ...) -> Tuple[bool, str]: ...
    def apply(self, ...) -> Tuple[List[Dict], List[Dict], str, dict]: ...

Import

from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective

I/O Contract

Inputs

Name	Type	Required	Description
op_config	Dict	Yes	Operation configuration to transform
pipeline_ops	List[Dict]	Yes	Full pipeline operations list
op_idx	int	Yes	Index of target operation
dataset_descriptions	Dict	Yes	Dataset schema descriptions

Outputs

Name	Type	Description
new_ops	List[Dict]	Transformed operation configs
new_steps	List[Dict]	Updated pipeline steps
explanation	str	Human-readable description of changes
metadata	dict	Additional metadata about the transformation

Usage Examples

# Directives are typically invoked by the MOAR agent automatically
# Example of manual invocation:
from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective

directive = DocumentChunkingDirective()
applicable, reason = directive.check_applicability(op_config, pipeline_ops, op_idx, dataset_descriptions)
if applicable:
    new_ops, new_steps, explanation, metadata = directive.apply(op_config, pipeline_ops, op_idx, dataset_descriptions)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment