Implementation:Marker Inc Korea AutoRAG Chunker Start Chunking

Knowledge Sources	AutoRAG
Domains	Natural Language Processing, Information Retrieval, Text Segmentation
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for splitting parsed documents into chunked passages provided by the AutoRAG framework.

Description

The Chunker class is the top-level entry point for the document chunking stage. It accepts a parsed DataFrame (produced by the Parser stage) and a project directory. The class can be instantiated directly with a DataFrame or via the from_parquet() class method that reads a parquet file produced by the parsing step.

The start_chunking() method reads a YAML configuration file specifying which chunking modules to apply and their parameters (such as chunk size and overlap), loads the modules via get_param_combinations(), and delegates to run_chunker() in autorag/data/chunk/run.py. The YAML configuration is copied to the project directory as chunk_config.yaml for reproducibility. The resulting chunked DataFrame is saved as parquet in the project directory with columns doc_id, contents, path, start_end_idx, and metadata.

Usage

Import and use the Chunker class after the parsing step has completed. It consumes the parsed parquet output and produces chunked passages for downstream sampling and QA generation.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/chunker.py (lines 14-51)

Signature

class Chunker:
    def __init__(self, raw_df: pd.DataFrame, project_dir: Optional[str] = None):
        ...

    @classmethod
    def from_parquet(cls, parsed_data_path: str, project_dir: Optional[str] = None) -> "Chunker":
        ...

    def start_chunking(self, yaml_path: str):
        ...

Import

from autorag.chunker import Chunker

I/O Contract

Inputs

Name	Type	Required	Description
raw_df	pd.DataFrame	yes (constructor)	Parsed DataFrame with columns: texts, path, page, last_modified_datetime
parsed_data_path	str	yes (from_parquet)	Path to parsed parquet file. Must end with ".parquet" and must exist.
project_dir	Optional[str]	no	Directory where chunked output and config will be stored. Defaults to current working directory.
yaml_path	str	yes	Path to the YAML configuration file specifying chunking modules and parameters

Outputs

Name	Type	Description
chunked parquet	File (parquet)	Parquet file in project_dir containing chunked passages with columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict)
chunk_config.yaml	File (YAML)	Copy of the input YAML configuration stored in project_dir for reproducibility

Usage Examples

Basic Usage with from_parquet

from autorag.chunker import Chunker

# Load from parsed parquet file
chunker = Chunker.from_parquet(
    parsed_data_path="./my_project/parse/parsed_result.parquet",
    project_dir="./my_project/chunk"
)

# Run chunking using a YAML configuration file
chunker.start_chunking(yaml_path="./config/chunk_config.yaml")

Basic Usage with DataFrame

import pandas as pd
from autorag.chunker import Chunker

# Load parsed DataFrame directly
parsed_df = pd.read_parquet("./my_project/parse/parsed_result.parquet")

chunker = Chunker(
    raw_df=parsed_df,
    project_dir="./my_project/chunk"
)

chunker.start_chunking(yaml_path="./config/chunk_config.yaml")

YAML Configuration Example

# chunk_config.yaml
- module_type: token
  chunk_size: 512
  chunk_overlap: 64

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Document_Chunking

Requires Environment

Environment:Marker_Inc_Korea_AutoRAG_Python_3_10_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment