Implementation:Infiniflow Ragflow Parser Chunk Methods
| Knowledge Sources | |
|---|---|
| Domains | RAG, Document_Processing |
| Last Updated | 2026-02-12 06:00 GMT |
Overview
Concrete tool for format-specific text chunking provided by RAGFlow's parser module system.
Description
Each parser module (naive, paper, book, laws, etc.) in rag/app/ implements a chunk() function with a standardized interface. The function receives the filename, file binary, page range, language, and parser configuration, then returns a list of chunk dictionaries. The naive chunker is the most commonly used and supports configurable token count, custom delimiters, and layout-aware splitting.
Usage
Dispatched automatically by build_chunks via the FACTORY dictionary. Each parser module is a separate Python file in rag/app/.
Code Reference
Source Location
- Repository: ragflow
- File: rag/app/naive.py (naive chunker), rag/app/paper.py, rag/app/book.py, rag/app/laws.py, etc.
- Lines: Varies per parser; naive.py is the reference implementation
Signature
# Standard interface for all parser modules
def chunk(
filename: str,
binary: bytes,
from_page: int = 0,
to_page: int = 100000000,
lang: str = "",
callback: callable = None,
kb_id: str = "",
parser_config: dict = {},
tenant_id: str = ""
) -> list[dict]:
"""Parse and chunk a document.
Args:
filename: Original filename (used for type detection).
binary: File content as bytes.
from_page: Start page (0-indexed).
to_page: End page (exclusive).
lang: Language hint.
callback: Progress callback.
kb_id: Knowledge base ID.
parser_config: Parser-specific options (chunk_token_num, delimiter, etc.).
tenant_id: Tenant ID.
Returns:
list[dict] - Chunks with content_with_weight and metadata.
"""
Import
from rag.app import naive, paper, book, laws, presentation, table, qa, picture, one, audio, email, tag
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| filename | str | Yes | Original filename |
| binary | bytes | Yes | File content |
| from_page | int | No | Start page (default 0) |
| to_page | int | No | End page (default 100000000) |
| lang | str | No | Language hint |
| parser_config | dict | No | Parser options (chunk_token_num, delimiter, layout_recognize) |
Outputs
| Name | Type | Description |
|---|---|---|
| chunks | list[dict] | Chunks with content_with_weight, page_num_int, top_int, position_int |
Usage Examples
from rag.app import naive
chunks = naive.chunk(
filename="report.pdf",
binary=open("report.pdf", "rb").read(),
parser_config={"chunk_token_num": 512, "delimiter": "\\n"}
)
print(f"Generated {len(chunks)} chunks")