Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl Directive DocChunking

From Leeroopedia


Knowledge Sources
Domains Pipeline_Optimization, LLM_Operations
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for transforming a single Map operation into a full chunking pipeline (Split, Gather, Map, Reduce) provided by the DocETL reasoning optimizer.

Description

The DocumentChunkingDirective class transforms a single Map operation into a chunking pipeline: splits long documents into chunks, gathers context around each chunk, optionally samples a subset of chunks for efficiency, processes chunks with a new Map operation, then reduces the results. By default, sampling is applied unless the task requires processing all chunks. This directive can only be applied to a top-level Map operation, not to a sub-map within an existing split/gather/reduce pipeline.

Usage

The MOAR agent applies this directive when processing long documents that need to extract information but the document is too long for a single Map operation. The agent automatically decides whether to sample chunks (for categorization, theme extraction) or process all chunks (for comprehensive extraction).

Code Reference

Source Location

Signature

class DocumentChunkingDirective(Directive):
    name = "doc_chunking"
    description = "Transforms a single Map operation into a chunking pipeline: Split -> Gather -> [Sample] -> Map -> Reduce."

    def check_applicability(self, ...) -> Tuple[bool, str]: ...
    def apply(self, ...) -> Tuple[List[Dict], List[Dict], str, dict]: ...

Import

from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective

I/O Contract

Inputs

Name Type Required Description
op_config Dict Yes Operation configuration to transform
pipeline_ops List[Dict] Yes Full pipeline operations list
op_idx int Yes Index of target operation
dataset_descriptions Dict Yes Dataset schema descriptions

Outputs

Name Type Description
new_ops List[Dict] Transformed operation configs
new_steps List[Dict] Updated pipeline steps
explanation str Human-readable description of changes
metadata dict Additional metadata about the transformation

Usage Examples

# Directives are typically invoked by the MOAR agent automatically
# Example of manual invocation:
from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective

directive = DocumentChunkingDirective()
applicable, reason = directive.check_applicability(op_config, pipeline_ops, op_idx, dataset_descriptions)
if applicable:
    new_ops, new_steps, explanation, metadata = directive.apply(op_config, pipeline_ops, op_idx, dataset_descriptions)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment