Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Infiniflow Ragflow Build Chunks

From Leeroopedia
Revision as of 11:19, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Infiniflow_Ragflow_Build_Chunks.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains RAG, Document_Processing
Last Updated 2026-02-12 06:00 GMT

Overview

Concrete tool for converting raw documents into text chunks provided by RAGFlow's task executor build_chunks function.

Description

The build_chunks function is the main document parsing orchestrator. It fetches the file binary from object storage, selects the appropriate parser from the FACTORY dictionary based on parser_id, invokes the parser's chunk() method, and handles RAPTOR summarization, GraphRAG processing, and auto-keyword extraction as post-processing steps. The function has an 80-minute timeout.

Usage

Called automatically by the task executor worker for each task consumed from Redis.

Code Reference

Source Location

  • Repository: ragflow
  • File: rag/svr/task_executor.py
  • Lines: L244-515 (build_chunks), L84-101 (FACTORY mapping)

Signature

FACTORY = {
    "general": naive,
    ParserType.NAIVE.value: naive,
    ParserType.PAPER.value: paper,
    ParserType.BOOK.value: book,
    ParserType.PRESENTATION.value: presentation,
    ParserType.MANUAL.value: manual,
    ParserType.LAWS.value: laws,
    ParserType.QA.value: qa,
    ParserType.TABLE.value: table,
    ParserType.RESUME.value: resume,
    ParserType.PICTURE.value: picture,
    ParserType.ONE.value: one,
    ParserType.AUDIO.value: audio,
    ParserType.EMAIL.value: email,
    ParserType.KG.value: naive,
    ParserType.TAG.value: tag,
}

@timeout(60 * 80, 1)
async def build_chunks(task: dict, progress_callback: callable) -> list[dict]:
    """Parse a document into chunks.

    Args:
        task: dict - Hydrated task dict from TaskService.get_task (21 fields).
        progress_callback: callable - Progress update callback.

    Returns:
        list[dict] - Chunks with content_with_weight, page_num_int, position_int, img_id, etc.
    """

Import

from rag.svr.task_executor import build_chunks, FACTORY

I/O Contract

Inputs

Name Type Required Description
task dict Yes Hydrated task dictionary (21 fields from get_task)
progress_callback callable Yes Function to report progress (0.0-1.0)

Outputs

Name Type Description
chunks list[dict] Document chunks with content_with_weight, page_num_int, top_int, position_int, available_int, img_id, image (optional bytes)

Usage Examples

# Internal usage within task_executor.py
task = TaskService.get_task(task_id)
chunks = await build_chunks(task, progress_callback=set_progress)
print(f"Parsed {len(chunks)} chunks from {task['name']}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment