Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ucbepic Docetl SplitOperation Execute

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-08 01:40 GMT

Overview

Concrete operation for splitting documents into chunks provided by DocETL's operations module.

Description

SplitOperation divides documents into chunks using either token counting (via tiktoken) or text delimiters. Each chunk receives a UUID-based document ID and sequential chunk number. The original document fields are preserved in each chunk record.

Usage

Use SplitOperation in a YAML pipeline or Python API when processing long documents. It is typically followed by GatherOperation (for context) and MapOperation (for per-chunk processing).

Code Reference

Source Location

  • Repository: docetl
  • File: docetl/operations/split.py
  • Lines: L10-120

Signature

class SplitOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "split"
        split_key: str
        method: str           # "token_count" or "delimiter"
        method_kwargs: dict[str, Any]
        model: str | None = None

    def execute(self, input_data: list[dict]) -> tuple[list[dict], float]:
        """Split documents into chunks. Returns (chunked_docs, cost=0.0)."""

Import

from docetl.operations.split import SplitOperation

I/O Contract

Inputs

Name Type Required Description
split_key str Yes Document field containing text to split
method str Yes "token_count" or "delimiter"
method_kwargs.num_tokens int Conditional Tokens per chunk (for token_count method)
method_kwargs.delimiter str Conditional Text delimiter (for delimiter method)
input_data list[dict] Yes Documents to split

Outputs

Name Type Description
results list[dict] Chunked documents with {split_key}_chunk, {name}_id, {name}_chunk_num fields
cost float Always 0.0 (no LLM calls)

Usage Examples

operations:
  - name: split_docs
    type: split
    split_key: content
    method: token_count
    method_kwargs:
      num_tokens: 2000
      model: gpt-4o

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment