Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Infiniflow Ragflow Parser Chunk Methods

From Leeroopedia
Revision as of 11:22, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Infiniflow_Ragflow_Parser_Chunk_Methods.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains RAG, Document_Processing
Last Updated 2026-02-12 06:00 GMT

Overview

Concrete tool for format-specific text chunking provided by RAGFlow's parser module system.

Description

Each parser module (naive, paper, book, laws, etc.) in rag/app/ implements a chunk() function with a standardized interface. The function receives the filename, file binary, page range, language, and parser configuration, then returns a list of chunk dictionaries. The naive chunker is the most commonly used and supports configurable token count, custom delimiters, and layout-aware splitting.

Usage

Dispatched automatically by build_chunks via the FACTORY dictionary. Each parser module is a separate Python file in rag/app/.

Code Reference

Source Location

  • Repository: ragflow
  • File: rag/app/naive.py (naive chunker), rag/app/paper.py, rag/app/book.py, rag/app/laws.py, etc.
  • Lines: Varies per parser; naive.py is the reference implementation

Signature

# Standard interface for all parser modules
def chunk(
    filename: str,
    binary: bytes,
    from_page: int = 0,
    to_page: int = 100000000,
    lang: str = "",
    callback: callable = None,
    kb_id: str = "",
    parser_config: dict = {},
    tenant_id: str = ""
) -> list[dict]:
    """Parse and chunk a document.

    Args:
        filename: Original filename (used for type detection).
        binary: File content as bytes.
        from_page: Start page (0-indexed).
        to_page: End page (exclusive).
        lang: Language hint.
        callback: Progress callback.
        kb_id: Knowledge base ID.
        parser_config: Parser-specific options (chunk_token_num, delimiter, etc.).
        tenant_id: Tenant ID.

    Returns:
        list[dict] - Chunks with content_with_weight and metadata.
    """

Import

from rag.app import naive, paper, book, laws, presentation, table, qa, picture, one, audio, email, tag

I/O Contract

Inputs

Name Type Required Description
filename str Yes Original filename
binary bytes Yes File content
from_page int No Start page (default 0)
to_page int No End page (default 100000000)
lang str No Language hint
parser_config dict No Parser options (chunk_token_num, delimiter, layout_recognize)

Outputs

Name Type Description
chunks list[dict] Chunks with content_with_weight, page_num_int, top_int, position_int

Usage Examples

from rag.app import naive

chunks = naive.chunk(
    filename="report.pdf",
    binary=open("report.pdf", "rb").read(),
    parser_config={"chunk_token_num": 512, "delimiter": "\\n"}
)
print(f"Generated {len(chunks)} chunks")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment