Implementation:Ucbepic Docetl SplitOperation Execute
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete operation for splitting documents into chunks provided by DocETL's operations module.
Description
SplitOperation divides documents into chunks using either token counting (via tiktoken) or text delimiters. Each chunk receives a UUID-based document ID and sequential chunk number. The original document fields are preserved in each chunk record.
Usage
Use SplitOperation in a YAML pipeline or Python API when processing long documents. It is typically followed by GatherOperation (for context) and MapOperation (for per-chunk processing).
Code Reference
Source Location
- Repository: docetl
- File: docetl/operations/split.py
- Lines: L10-120
Signature
class SplitOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "split"
split_key: str
method: str # "token_count" or "delimiter"
method_kwargs: dict[str, Any]
model: str | None = None
def execute(self, input_data: list[dict]) -> tuple[list[dict], float]:
"""Split documents into chunks. Returns (chunked_docs, cost=0.0)."""
Import
from docetl.operations.split import SplitOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| split_key | str | Yes | Document field containing text to split |
| method | str | Yes | "token_count" or "delimiter" |
| method_kwargs.num_tokens | int | Conditional | Tokens per chunk (for token_count method) |
| method_kwargs.delimiter | str | Conditional | Text delimiter (for delimiter method) |
| input_data | list[dict] | Yes | Documents to split |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[dict] | Chunked documents with {split_key}_chunk, {name}_id, {name}_chunk_num fields |
| cost | float | Always 0.0 (no LLM calls) |
Usage Examples
operations:
- name: split_docs
type: split
split_key: content
method: token_count
method_kwargs:
num_tokens: 2000
model: gpt-4o
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment