Implementation:Ucbepic Docetl Directive DocChunking
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Optimization, LLM_Operations |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for transforming a single Map operation into a full chunking pipeline (Split, Gather, Map, Reduce) provided by the DocETL reasoning optimizer.
Description
The DocumentChunkingDirective class transforms a single Map operation into a chunking pipeline: splits long documents into chunks, gathers context around each chunk, optionally samples a subset of chunks for efficiency, processes chunks with a new Map operation, then reduces the results. By default, sampling is applied unless the task requires processing all chunks. This directive can only be applied to a top-level Map operation, not to a sub-map within an existing split/gather/reduce pipeline.
Usage
The MOAR agent applies this directive when processing long documents that need to extract information but the document is too long for a single Map operation. The agent automatically decides whether to sample chunks (for categorization, theme extraction) or process all chunks (for comprehensive extraction).
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/reasoning_optimizer/directives/doc_chunking.py
- Lines: 1-466
Signature
class DocumentChunkingDirective(Directive):
name = "doc_chunking"
description = "Transforms a single Map operation into a chunking pipeline: Split -> Gather -> [Sample] -> Map -> Reduce."
def check_applicability(self, ...) -> Tuple[bool, str]: ...
def apply(self, ...) -> Tuple[List[Dict], List[Dict], str, dict]: ...
Import
from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| op_config | Dict | Yes | Operation configuration to transform |
| pipeline_ops | List[Dict] | Yes | Full pipeline operations list |
| op_idx | int | Yes | Index of target operation |
| dataset_descriptions | Dict | Yes | Dataset schema descriptions |
Outputs
| Name | Type | Description |
|---|---|---|
| new_ops | List[Dict] | Transformed operation configs |
| new_steps | List[Dict] | Updated pipeline steps |
| explanation | str | Human-readable description of changes |
| metadata | dict | Additional metadata about the transformation |
Usage Examples
# Directives are typically invoked by the MOAR agent automatically
# Example of manual invocation:
from docetl.reasoning_optimizer.directives.doc_chunking import DocumentChunkingDirective
directive = DocumentChunkingDirective()
applicable, reason = directive.check_applicability(op_config, pipeline_ops, op_idx, dataset_descriptions)
if applicable:
new_ops, new_steps, explanation, metadata = directive.apply(op_config, pipeline_ops, op_idx, dataset_descriptions)