Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Marker Inc Korea AutoRAG Chunker Start Chunking

From Leeroopedia
Knowledge Sources
Domains Natural Language Processing, Information Retrieval, Text Segmentation
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for splitting parsed documents into chunked passages provided by the AutoRAG framework.

Description

The Chunker class is the top-level entry point for the document chunking stage. It accepts a parsed DataFrame (produced by the Parser stage) and a project directory. The class can be instantiated directly with a DataFrame or via the from_parquet() class method that reads a parquet file produced by the parsing step.

The start_chunking() method reads a YAML configuration file specifying which chunking modules to apply and their parameters (such as chunk size and overlap), loads the modules via get_param_combinations(), and delegates to run_chunker() in autorag/data/chunk/run.py. The YAML configuration is copied to the project directory as chunk_config.yaml for reproducibility. The resulting chunked DataFrame is saved as parquet in the project directory with columns doc_id, contents, path, start_end_idx, and metadata.

Usage

Import and use the Chunker class after the parsing step has completed. It consumes the parsed parquet output and produces chunked passages for downstream sampling and QA generation.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/chunker.py (lines 14-51)

Signature

class Chunker:
    def __init__(self, raw_df: pd.DataFrame, project_dir: Optional[str] = None):
        ...

    @classmethod
    def from_parquet(cls, parsed_data_path: str, project_dir: Optional[str] = None) -> "Chunker":
        ...

    def start_chunking(self, yaml_path: str):
        ...

Import

from autorag.chunker import Chunker

I/O Contract

Inputs

Name Type Required Description
raw_df pd.DataFrame yes (constructor) Parsed DataFrame with columns: texts, path, page, last_modified_datetime
parsed_data_path str yes (from_parquet) Path to parsed parquet file. Must end with ".parquet" and must exist.
project_dir Optional[str] no Directory where chunked output and config will be stored. Defaults to current working directory.
yaml_path str yes Path to the YAML configuration file specifying chunking modules and parameters

Outputs

Name Type Description
chunked parquet File (parquet) Parquet file in project_dir containing chunked passages with columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict)
chunk_config.yaml File (YAML) Copy of the input YAML configuration stored in project_dir for reproducibility

Usage Examples

Basic Usage with from_parquet

from autorag.chunker import Chunker

# Load from parsed parquet file
chunker = Chunker.from_parquet(
    parsed_data_path="./my_project/parse/parsed_result.parquet",
    project_dir="./my_project/chunk"
)

# Run chunking using a YAML configuration file
chunker.start_chunking(yaml_path="./config/chunk_config.yaml")

Basic Usage with DataFrame

import pandas as pd
from autorag.chunker import Chunker

# Load parsed DataFrame directly
parsed_df = pd.read_parquet("./my_project/parse/parsed_result.parquet")

chunker = Chunker(
    raw_df=parsed_df,
    project_dir="./my_project/chunk"
)

chunker.start_chunking(yaml_path="./config/chunk_config.yaml")

YAML Configuration Example

# chunk_config.yaml
- module_type: token
  chunk_size: 512
  chunk_overlap: 64

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment