Implementation:Infiniflow Ragflow Parser Chunk Methods

Knowledge Sources	RAGFlow
Domains	RAG, Document_Processing
Last Updated	2026-02-12 06:00 GMT

Overview

Concrete tool for format-specific text chunking provided by RAGFlow's parser module system.

Description

Each parser module (naive, paper, book, laws, etc.) in rag/app/ implements a chunk() function with a standardized interface. The function receives the filename, file binary, page range, language, and parser configuration, then returns a list of chunk dictionaries. The naive chunker is the most commonly used and supports configurable token count, custom delimiters, and layout-aware splitting.

Usage

Dispatched automatically by build_chunks via the FACTORY dictionary. Each parser module is a separate Python file in rag/app/.

Code Reference

Source Location

Repository: ragflow
File: rag/app/naive.py (naive chunker), rag/app/paper.py, rag/app/book.py, rag/app/laws.py, etc.
Lines: Varies per parser; naive.py is the reference implementation

Signature

# Standard interface for all parser modules
def chunk(
    filename: str,
    binary: bytes,
    from_page: int = 0,
    to_page: int = 100000000,
    lang: str = "",
    callback: callable = None,
    kb_id: str = "",
    parser_config: dict = {},
    tenant_id: str = ""
) -> list[dict]:
    """Parse and chunk a document.

    Args:
        filename: Original filename (used for type detection).
        binary: File content as bytes.
        from_page: Start page (0-indexed).
        to_page: End page (exclusive).
        lang: Language hint.
        callback: Progress callback.
        kb_id: Knowledge base ID.
        parser_config: Parser-specific options (chunk_token_num, delimiter, etc.).
        tenant_id: Tenant ID.

    Returns:
        list[dict] - Chunks with content_with_weight and metadata.
    """

Import

from rag.app import naive, paper, book, laws, presentation, table, qa, picture, one, audio, email, tag

I/O Contract

Inputs

Name	Type	Required	Description
filename	str	Yes	Original filename
binary	bytes	Yes	File content
from_page	int	No	Start page (default 0)
to_page	int	No	End page (default 100000000)
lang	str	No	Language hint
parser_config	dict	No	Parser options (chunk_token_num, delimiter, layout_recognize)

Outputs

Name	Type	Description
chunks	list[dict]	Chunks with content_with_weight, page_num_int, top_int, position_int

Usage Examples

from rag.app import naive

chunks = naive.chunk(
    filename="report.pdf",
    binary=open("report.pdf", "rb").read(),
    parser_config={"chunk_token_num": 512, "delimiter": "\\n"}
)
print(f"Generated {len(chunks)} chunks")

Related Pages

Implements Principle

Principle:Infiniflow_Ragflow_Text_Chunking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment