Implementation:Ucbepic Docetl Directive ChunkHeaderSummary
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Optimization, LLM_Operations |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for enhancing a Split-Gather pipeline with header extraction and chunk summarization provided by the DocETL reasoning optimizer.
Description
The ChunkHeaderSummaryDirective class transforms an existing Split -> Gather pipeline by inserting a Map operation between them that extracts headers and creates summaries from each chunk. The Gather operation is then modified to use summaries for middle chunks and headers for document structure. This directive enhances chunking pipelines with header extraction and chunk summarization capabilities for hierarchically structured documents.
Usage
The MOAR agent applies this directive when there is an existing chunking pipeline (Split -> Gather) processing documents with clear hierarchical structure (legal contracts, technical manuals, research papers), and it is evident that chunk-level analysis is not accurate because the chunk needs headers and summaries from other chunks to make sense.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/reasoning_optimizer/directives/chunk_header_summary.py
- Lines: 1-348
Signature
class ChunkHeaderSummaryDirective(Directive):
name = "chunk_header_summary"
description = "Transforms Split -> Gather into Split -> Map -> Gather with header extraction and chunk summarization."
def check_applicability(self, ...) -> Tuple[bool, str]: ...
def apply(self, ...) -> Tuple[List[Dict], List[Dict], str, dict]: ...
Import
from docetl.reasoning_optimizer.directives.chunk_header_summary import ChunkHeaderSummaryDirective
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| op_config | Dict | Yes | Operation configuration to transform |
| pipeline_ops | List[Dict] | Yes | Full pipeline operations list |
| op_idx | int | Yes | Index of target operation |
| dataset_descriptions | Dict | Yes | Dataset schema descriptions |
Outputs
| Name | Type | Description |
|---|---|---|
| new_ops | List[Dict] | Transformed operation configs |
| new_steps | List[Dict] | Updated pipeline steps |
| explanation | str | Human-readable description of changes |
| metadata | dict | Additional metadata about the transformation |
Usage Examples
# Directives are typically invoked by the MOAR agent automatically
# Example of manual invocation:
from docetl.reasoning_optimizer.directives.chunk_header_summary import ChunkHeaderSummaryDirective
directive = ChunkHeaderSummaryDirective()
applicable, reason = directive.check_applicability(op_config, pipeline_ops, op_idx, dataset_descriptions)
if applicable:
new_ops, new_steps, explanation, metadata = directive.apply(op_config, pipeline_ops, op_idx, dataset_descriptions)