Implementation:Marker Inc Korea AutoRAG Chunker Start Chunking
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Information Retrieval, Text Segmentation |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for splitting parsed documents into chunked passages provided by the AutoRAG framework.
Description
The Chunker class is the top-level entry point for the document chunking stage. It accepts a parsed DataFrame (produced by the Parser stage) and a project directory. The class can be instantiated directly with a DataFrame or via the from_parquet() class method that reads a parquet file produced by the parsing step.
The start_chunking() method reads a YAML configuration file specifying which chunking modules to apply and their parameters (such as chunk size and overlap), loads the modules via get_param_combinations(), and delegates to run_chunker() in autorag/data/chunk/run.py. The YAML configuration is copied to the project directory as chunk_config.yaml for reproducibility. The resulting chunked DataFrame is saved as parquet in the project directory with columns doc_id, contents, path, start_end_idx, and metadata.
Usage
Import and use the Chunker class after the parsing step has completed. It consumes the parsed parquet output and produces chunked passages for downstream sampling and QA generation.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/chunker.py (lines 14-51)
Signature
class Chunker:
def __init__(self, raw_df: pd.DataFrame, project_dir: Optional[str] = None):
...
@classmethod
def from_parquet(cls, parsed_data_path: str, project_dir: Optional[str] = None) -> "Chunker":
...
def start_chunking(self, yaml_path: str):
...
Import
from autorag.chunker import Chunker
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| raw_df | pd.DataFrame | yes (constructor) | Parsed DataFrame with columns: texts, path, page, last_modified_datetime |
| parsed_data_path | str | yes (from_parquet) | Path to parsed parquet file. Must end with ".parquet" and must exist. |
| project_dir | Optional[str] | no | Directory where chunked output and config will be stored. Defaults to current working directory. |
| yaml_path | str | yes | Path to the YAML configuration file specifying chunking modules and parameters |
Outputs
| Name | Type | Description |
|---|---|---|
| chunked parquet | File (parquet) | Parquet file in project_dir containing chunked passages with columns: doc_id (str), contents (str), path (str), start_end_idx (tuple), metadata (dict) |
| chunk_config.yaml | File (YAML) | Copy of the input YAML configuration stored in project_dir for reproducibility |
Usage Examples
Basic Usage with from_parquet
from autorag.chunker import Chunker
# Load from parsed parquet file
chunker = Chunker.from_parquet(
parsed_data_path="./my_project/parse/parsed_result.parquet",
project_dir="./my_project/chunk"
)
# Run chunking using a YAML configuration file
chunker.start_chunking(yaml_path="./config/chunk_config.yaml")
Basic Usage with DataFrame
import pandas as pd
from autorag.chunker import Chunker
# Load parsed DataFrame directly
parsed_df = pd.read_parquet("./my_project/parse/parsed_result.parquet")
chunker = Chunker(
raw_df=parsed_df,
project_dir="./my_project/chunk"
)
chunker.start_chunking(yaml_path="./config/chunk_config.yaml")
YAML Configuration Example
# chunk_config.yaml
- module_type: token
chunk_size: 512
chunk_overlap: 64