Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ucbepic Docetl Long Document Chunking

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Ucbepic_Docetl_Long_Document_Chunking.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLM_Ops, Document_Processing, Chunking
Last Updated 2026-02-08 03:00 GMT

Overview

End-to-end process for analyzing documents that exceed LLM context windows by splitting them into chunks, gathering peripheral context, processing each chunk independently, and reducing chunk-level results into a unified output.

Description

This workflow addresses the challenge of processing documents that are too long to fit within an LLM's context window. DocETL provides Split and Gather operations that break documents into token-counted or delimiter-based chunks, then enrich each chunk with surrounding context (previous/next chunks, document-level metadata, and hierarchical headers). After chunking, a map operation processes each chunk independently, and a reduce operation merges chunk-level results back into a single document-level output. This pattern (split-gather-map-reduce) is a core DocETL idiom for long document analysis, applicable to legal documents, research papers, lengthy transcripts, and any text exceeding context limits.

Usage

Execute this workflow when your documents exceed the LLM's context window (typically 128K tokens for current models) or when processing the full document in one call produces low-quality results due to attention dilution. Common scenarios include analyzing lengthy legal filings, processing concatenated reviews spanning hundreds of pages, or extracting information from long-form transcripts. The split-gather pattern can also be synthesized automatically by the V1 optimizer when it detects that documents exceed context limits.

Execution Steps

Step 1: Assess Document Length

Determine whether documents in the dataset exceed the target LLM's context window. Check the token count of the text field that will be processed. If any documents approach or exceed the context limit, chunking is necessary. Also consider that even within context limits, very long inputs may degrade LLM output quality.

Key considerations:

  • Token counts vary by model and tokenizer
  • Documents with 12% or more items exceeding context limits are strong candidates for chunking
  • Even documents within limits but above ~50K tokens may benefit from chunking for quality

Step 2: Configure Split Operation

Define a split operation in the YAML pipeline that divides the document's text field into chunks. Choose the splitting method: token_count (fixed chunk size in tokens) or delimiter (split on a specific string pattern). Specify the split_key indicating which document field contains the text to split.

Key considerations:

  • Token count method provides consistent chunk sizes; typical values range from 3,000 to 90,000 tokens depending on model context
  • Delimiter method is useful for documents with natural section boundaries
  • The split operation automatically generates chunk ID and chunk number fields for downstream operations

Step 3: Extract Headers and Gather Context

Optionally, define a map operation to extract hierarchical headers from each chunk. Then configure a gather operation that enriches each chunk with peripheral context: content from adjacent chunks (previous tail, next head), document-level metadata, and the hierarchical header tree above the current position. This ensures each chunk retains enough context for accurate analysis.

Key considerations:

  • Header extraction uses an LLM to identify section headings and their nesting levels
  • Peripheral chunks are configured as fractions or counts of adjacent chunks to include
  • The gather operation produces a rendered chunk field that combines the main chunk with its context
  • The doc_id_key and order_key must match the fields generated by the split operation

Step 4: Process Each Chunk

Define a map operation that processes each enriched chunk independently. The prompt should instruct the LLM to analyze only the main chunk content while using the peripheral context for understanding. The output schema defines what to extract from each chunk.

Key considerations:

  • Prompt should explicitly instruct to only process the main chunk
  • Peripheral context provides background but should not be analyzed directly
  • Output schema should match what will be merged in the reduce step

Step 5: Reduce Chunk Results

Define a reduce operation that merges results from all chunks of the same original document. The reduce key is the document ID generated by the split operation. This step combines, deduplicates, and synthesizes chunk-level extractions into a coherent document-level output.

Key considerations:

  • Set associative: true if the reduce operation can be applied incrementally
  • Set pass_through: true to preserve original document fields
  • Set synthesize_resolve: false if entity resolution across chunks is not needed
  • The reduce prompt should handle merging overlapping or duplicate information from different chunks

Execution Diagram

GitHub URL

Workflow Repository