Implementation:Ucbepic Docetl ReduceOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Aggregation |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete operation for grouping and reducing document records using LLM-powered synthesis provided by DocETL's operations module.
Description
ReduceOperation groups input documents by one or more reduce keys, then synthesizes each group into a single output record using an LLM prompt. It supports multiple reduction strategies: batch reduce (all items in one call), incremental fold (process items in batches with a fold prompt), and parallel fold with merge. The operation also supports value sampling for large groups and gleaning for quality refinement.
Usage
Use ReduceOperation to merge chunk-level results back into per-document summaries. Set reduce_key to the document ID from SplitOperation. For very large groups, configure fold_prompt and fold_batch_size.
Code Reference
Source Location
- Repository: docetl
- File: docetl/operations/reduce.py
- Lines: L42-1047
Signature
class ReduceOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "reduce"
reduce_key: str | list[str]
output: dict[str, Any]
prompt: str
model: str | None = None
fold_prompt: str | None = None
fold_batch_size: int | None = None
merge_prompt: str | None = None
merge_batch_size: int | None = None
pass_through: bool | None = None
def execute(self, input_data: list[dict]) -> tuple[list[dict], float]:
"""Group and reduce documents. Returns (reduced_results, total_cost)."""
Import
from docetl.operations.reduce import ReduceOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reduce_key | str or list[str] | Yes | Field(s) to group by (typically document ID) |
| prompt | str | Yes | Jinja2 template with Template:Inputs for group items |
| output.schema | dict | Yes | Expected output fields and types |
| fold_prompt | str | No | Template for incremental fold reduction |
| fold_batch_size | int | No | Items per fold batch |
| input_data | list[dict] | Yes | Per-chunk results from MapOperation |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[dict] | One merged result per group (per document) |
| cost | float | Total LLM API cost |
Usage Examples
operations:
- name: merge_chunks
type: reduce
reduce_key: split_docs_id
prompt: |
Combine the following chunk analyses into a single document summary:
{% for item in inputs %}
Chunk {{ item.split_docs_chunk_num }}:
Findings: {{ item.key_findings }}
{% endfor %}
output:
schema:
combined_findings: "list[str]"
document_summary: "string"
model: gpt-4o