Implementation:Marker Inc Korea AutoRAG Parser Start Parsing

Knowledge Sources	AutoRAG
Domains	Natural Language Processing, Information Retrieval, Document Understanding
Last Updated	2026-02-12 00:00 GMT

Overview

Concrete tool for parsing raw document files into structured text DataFrames provided by the AutoRAG framework.

Description

The Parser class is the top-level entry point for the document parsing stage of AutoRAG's evaluation data creation workflow. It accepts a glob pattern identifying the raw document files to process and a project directory where outputs will be stored. The start_parsing() method reads a YAML configuration file that specifies which parser modules to use for each file type, loads the corresponding modules via get_param_combinations(), and delegates the actual parsing work to run_parser().

The run_parser() function at autorag/data/parse/run.py orchestrates multi-module parsing. It supports a default parser mapping for common file types (PDF via pdfminer, CSV, Markdown, HTML, XML) and automatically assigns default parsers for file types present in the data but not explicitly configured. Each parser module is wrapped by the parser_node decorator (autorag/data/parse/base.py), which normalizes outputs to a consistent four-column schema: texts, path, page, and last_modified_datetime. Results are saved as parquet files in the project directory, with a summary.csv recording execution times.

Usage

Import and use the Parser class when starting a new evaluation data creation pipeline from raw document files. This is typically the first programmatic step before chunking.

Code Reference

Source Location

Repository: AutoRAG
File: autorag/parser.py (lines 12-37)
Supporting files: autorag/data/parse/run.py (lines 38-141), autorag/data/parse/base.py (lines 14-71)

Signature

class Parser:
    def __init__(self, data_path_glob: str, project_dir: Optional[str] = None):
        ...

    def start_parsing(self, yaml_path: str, all_files: bool = False):
        ...

Import

from autorag.parser import Parser

I/O Contract

Inputs

Name	Type	Required	Description
data_path_glob	str	yes	Glob pattern matching the raw document files to parse (e.g., "./data/documents/*")
project_dir	Optional[str]	no	Directory where parsed output and config will be stored. Defaults to current working directory.
yaml_path	str	yes	Path to the YAML configuration file specifying parser modules and their parameters for each file type
all_files	bool	no	If True, uses a single parser for all files regardless of type. Defaults to False.

Outputs

Name	Type	Description
parsed_result.parquet	File (parquet)	Parquet file in project_dir containing the combined parsed results with columns: texts (str), path (str), page (int), last_modified_datetime (datetime)
{file_type}.parquet	File (parquet)	Per-file-type parquet files (e.g., pdf.parquet, csv.parquet) when all_files is False
summary.csv	File (CSV)	Summary of each parser module's execution including filename, module name, parameters, and execution time
parse_config.yaml	File (YAML)	Copy of the input YAML configuration stored in project_dir for reproducibility

Usage Examples

Basic Usage

from autorag.parser import Parser

# Initialize parser with a glob pattern for all PDF and CSV files
parser = Parser(
    data_path_glob="./data/raw_documents/*",
    project_dir="./my_project/parse"
)

# Run parsing using a YAML configuration file
parser.start_parsing(yaml_path="./config/parse_config.yaml")

# After parsing, the output parquet file is at:
# ./my_project/parse/parsed_result.parquet

YAML Configuration Example

# parse_config.yaml
- file_type: pdf
  module_type: langchain_parse
  parse_method: pdfminer
- file_type: csv
  module_type: langchain_parse
  parse_method: csv

Using all_files Mode

from autorag.parser import Parser

parser = Parser(
    data_path_glob="./data/mixed_files/*",
    project_dir="./my_project/parse"
)

# Parse all files with a single module (requires YAML with one module entry)
parser.start_parsing(yaml_path="./config/all_files_config.yaml", all_files=True)

Related Pages

Implements Principle

Principle:Marker_Inc_Korea_AutoRAG_Document_Parsing

Requires Environment

Environment:Marker_Inc_Korea_AutoRAG_Python_3_10_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment