Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Unstructured IO Unstructured Chunking And Embedding

From Leeroopedia
Knowledge Sources
Domains RAG, Data_Engineering, NLP, Vector_Search
Last Updated 2026-02-12 09:30 GMT

Overview

End-to-end process for transforming partitioned document elements into optimally-sized chunks with vector embeddings, preparing content for retrieval-augmented generation (RAG) and semantic search pipelines.

Description

This workflow covers the post-partition processing stages that prepare structured elements for downstream consumption by LLMs and vector databases. After documents have been partitioned into typed elements, the chunking stage groups and splits elements into appropriately-sized text chunks that respect semantic boundaries. Two strategies are available: basic chunking (sequential fill) and by-title chunking (section-aware splitting). The embedding stage then generates vector representations of each chunk using one of seven supported embedding providers (OpenAI, HuggingFace, AWS Bedrock, Google Vertex AI, Voyage AI, OctoAI, MixedBread AI).

Key capabilities:

  • Two chunking strategies: basic (sequential) and by-title (section-aware)
  • Character-based and token-based chunk size control
  • Soft max (new_after_n_chars) and hard max (max_characters) size limits
  • Overlap between chunks for context preservation
  • Original element preservation within chunks for metadata traceability
  • Seven embedding providers with a consistent interface
  • Embeddings attached directly to element metadata

Usage

Execute this workflow after you have partitioned documents into structured elements and need to prepare them for storage in a vector database or for use in a RAG pipeline. This is essential when element-level granularity is too fine for your retrieval system and you need consolidated, embeddings-enriched chunks that respect document structure.

Execution Steps

Step 1: Element_Preparation

Start with a list of Element objects produced by the partition pipeline. Review the element types and sizes to determine appropriate chunking parameters. Elements that are already small (e.g., short titles or list items) will be combined into larger chunks. Oversized elements (e.g., very long narrative text blocks) will be split at sentence boundaries to fit within the maximum chunk size.

Key considerations:

  • Inspect element type distribution to understand your content structure
  • Identify whether your documents have clear section boundaries (titles) for by-title chunking
  • Consider your downstream system's input size limits when choosing chunk parameters

Step 2: Chunking_Strategy_Selection

Choose between basic chunking and by-title chunking based on your document structure and retrieval needs. Basic chunking fills each chunk sequentially with elements until the size limit is reached, maximizing chunk density. By-title chunking uses Title elements as section boundaries, creating chunks that respect the document's semantic structure and never mix content from different sections.

Key considerations:

  • Use basic chunking for documents without clear section structure or when maximum chunk density is desired
  • Use by-title chunking for well-structured documents where section boundaries are meaningful for retrieval
  • By-title chunking supports combine_text_under_n_chars to merge small sections
  • By-title chunking supports multipage_sections to control whether sections can span pages

Step 3: Chunk_Size_Configuration

Configure the chunk size parameters to match your embedding model and retrieval system requirements. Set max_characters (hard maximum) and new_after_n_chars (soft maximum that triggers a new chunk at the next element boundary). Alternatively, use token-based limits with max_tokens and new_after_n_tokens when working with specific tokenizers. Configure overlap to include trailing text from the previous chunk at the start of the next chunk.

Key considerations:

  • Default max_characters is 500; adjust based on your embedding model's context window
  • Set new_after_n_chars lower than max_characters to prefer breaking at element boundaries
  • Use overlap (character count) to preserve context across chunk boundaries
  • Token-based chunking uses tiktoken and supports custom tokenizer specification
  • Enable include_orig_elements to preserve original element references within each chunk

Step 4: Chunking_Execution

Execute the chunking function on your list of elements. The chunker processes elements sequentially, combining small elements into CompositeElement chunks and splitting oversized elements at sentence boundaries. Each output chunk contains the concatenated text, merged metadata from its constituent elements, and optionally references to the original elements.

Key considerations:

  • Output elements are CompositeElement instances (or TableChunk for table content)
  • Metadata from constituent elements is merged (e.g., page numbers span the range)
  • The is_continuation flag indicates elements that were split across chunks
  • Empty elements are filtered out during chunking

Step 5: Embedding_Provider_Selection

Select an embedding provider based on your infrastructure, cost, and quality requirements. OpenAI provides the widely-used text-embedding-ada-002 model. HuggingFace offers local embedding with sentence-transformers models. Cloud providers (Bedrock, Vertex AI) integrate with existing cloud infrastructure. Specialized providers (Voyage AI, MixedBread AI, OctoAI) offer domain-optimized embeddings.

Key considerations:

  • OpenAI embeddings require an API key and incur per-token costs
  • HuggingFace embeddings run locally with no API costs but require model download
  • Cloud provider embeddings integrate with existing IAM and billing
  • Embedding dimensions vary by provider and model; ensure compatibility with your vector database

Step 6: Embedding_Generation

Generate vector embeddings for each chunk using the selected provider. The embedding encoder processes the text content of each element and attaches the resulting vector to the element's metadata. Elements can then be serialized to JSON with their embeddings included, or passed directly to a vector database connector.

Key considerations:

  • Embeddings are added to each element's metadata as a list of floats
  • Batch processing is used internally for efficiency with API-based providers
  • Empty text elements are skipped during embedding
  • The embed_documents method processes a list of elements and returns them with embeddings attached

Execution Diagram

GitHub URL

Workflow Repository