Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FlowiseAI Flowise Text Splitter Configuration

From Leeroopedia
Revision as of 17:47, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/FlowiseAI_Flowise_Text_Splitter_Configuration.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Attribute Value
Sources packages/ui/src/api/nodes.js
Domains Document_Store_Ingestion
Last Updated 2026-02-12 14:00 GMT

Overview

Text_Splitter_Configuration is a technique for selecting and configuring text splitting strategies that break documents into semantically meaningful chunks for embedding. The quality of text splitting directly impacts retrieval accuracy in the RAG pipeline, making splitter configuration a critical optimization step.

Description

Text splitters divide large documents into smaller chunks suitable for embedding and retrieval. Different strategies exist for how documents are divided:

  • Character-based splitting -- Splits text at fixed character counts. Simple but may break mid-sentence or mid-word.
  • Recursive character splitting -- Splits on a hierarchy of separators (paragraphs, then sentences, then words) to preserve semantic boundaries. The most commonly used strategy.
  • Token-based splitting -- Splits based on token count (aligned with the embedding model's tokenizer), ensuring chunks fit within embedding model context windows.
  • Semantic splitting -- Uses embedding similarity to identify natural topic boundaries within the text.

Key configuration parameters include:

  • Chunk size -- The maximum size of each chunk (in characters or tokens). Smaller chunks yield more precise retrieval but less context per result.
  • Chunk overlap -- The number of characters/tokens shared between adjacent chunks. Overlap ensures that information at chunk boundaries is not lost.
  • Separators -- For recursive splitting, the ordered list of separator characters (e.g., \n\n, \n, , ).

Usage

Use text splitter configuration when configuring how documents should be chunked before embedding in the RAG pipeline. Typical scenarios include:

  • Long-form documents -- Splitting research papers or technical documentation where preserving paragraph and section boundaries matters.
  • Structured data -- Configuring splitters for code files, markdown, or HTML where format-aware splitting preserves structural semantics.
  • Embedding model alignment -- Setting chunk sizes that align with the embedding model's optimal input length (e.g., 512 tokens for many models).
// Fetching available text splitter components
const response = await nodesApi.getNodesByCategory('Text Splitters')
const splitters = response.data
// Each splitter component has inputParams defining chunk_size, chunk_overlap, etc.

Theoretical Basis

Text splitter configuration is grounded in chunking strategy optimization for retrieval-augmented generation:

  • Precision vs. context tradeoff -- Chunk size directly affects this tradeoff. Smaller chunks (100-200 tokens) yield more precise retrieval because each chunk covers a narrow topic, but individual results carry less context. Larger chunks (500-1000 tokens) provide more context per result but may dilute relevance with off-topic content.
  • Boundary preservation -- Recursive splitting preserves semantic boundaries (paragraphs, sentences) better than fixed-size splitting. When a document is split at a paragraph boundary, each chunk is more likely to contain a complete thought, improving both embedding quality and retrieval relevance.
  • Overlap for continuity -- Overlap ensures that information spanning chunk boundaries is captured in at least one chunk. A typical overlap of 10-20% of chunk size (e.g., 200-character overlap for 1000-character chunks) balances redundancy against coverage.
  • Format-aware splitting -- Specialized splitters for code, markdown, or HTML use format-specific delimiters (function boundaries, heading levels, tag structures) to produce chunks that respect the document's inherent structure.

The optimal configuration depends on the specific use case, document characteristics, and embedding model, making the preview-then-commit workflow (see Principle:FlowiseAI_Flowise_Chunk_Preview) essential for iterative tuning.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment