Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Marker Inc Korea AutoRAG Parser Start Parsing

From Leeroopedia
Knowledge Sources
Domains Natural Language Processing, Information Retrieval, Document Understanding
Last Updated 2026-02-12 00:00 GMT

Overview

Concrete tool for parsing raw document files into structured text DataFrames provided by the AutoRAG framework.

Description

The Parser class is the top-level entry point for the document parsing stage of AutoRAG's evaluation data creation workflow. It accepts a glob pattern identifying the raw document files to process and a project directory where outputs will be stored. The start_parsing() method reads a YAML configuration file that specifies which parser modules to use for each file type, loads the corresponding modules via get_param_combinations(), and delegates the actual parsing work to run_parser().

The run_parser() function at autorag/data/parse/run.py orchestrates multi-module parsing. It supports a default parser mapping for common file types (PDF via pdfminer, CSV, Markdown, HTML, XML) and automatically assigns default parsers for file types present in the data but not explicitly configured. Each parser module is wrapped by the parser_node decorator (autorag/data/parse/base.py), which normalizes outputs to a consistent four-column schema: texts, path, page, and last_modified_datetime. Results are saved as parquet files in the project directory, with a summary.csv recording execution times.

Usage

Import and use the Parser class when starting a new evaluation data creation pipeline from raw document files. This is typically the first programmatic step before chunking.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/parser.py (lines 12-37)
  • Supporting files: autorag/data/parse/run.py (lines 38-141), autorag/data/parse/base.py (lines 14-71)

Signature

class Parser:
    def __init__(self, data_path_glob: str, project_dir: Optional[str] = None):
        ...

    def start_parsing(self, yaml_path: str, all_files: bool = False):
        ...

Import

from autorag.parser import Parser

I/O Contract

Inputs

Name Type Required Description
data_path_glob str yes Glob pattern matching the raw document files to parse (e.g., "./data/documents/*")
project_dir Optional[str] no Directory where parsed output and config will be stored. Defaults to current working directory.
yaml_path str yes Path to the YAML configuration file specifying parser modules and their parameters for each file type
all_files bool no If True, uses a single parser for all files regardless of type. Defaults to False.

Outputs

Name Type Description
parsed_result.parquet File (parquet) Parquet file in project_dir containing the combined parsed results with columns: texts (str), path (str), page (int), last_modified_datetime (datetime)
{file_type}.parquet File (parquet) Per-file-type parquet files (e.g., pdf.parquet, csv.parquet) when all_files is False
summary.csv File (CSV) Summary of each parser module's execution including filename, module name, parameters, and execution time
parse_config.yaml File (YAML) Copy of the input YAML configuration stored in project_dir for reproducibility

Usage Examples

Basic Usage

from autorag.parser import Parser

# Initialize parser with a glob pattern for all PDF and CSV files
parser = Parser(
    data_path_glob="./data/raw_documents/*",
    project_dir="./my_project/parse"
)

# Run parsing using a YAML configuration file
parser.start_parsing(yaml_path="./config/parse_config.yaml")

# After parsing, the output parquet file is at:
# ./my_project/parse/parsed_result.parquet

YAML Configuration Example

# parse_config.yaml
- file_type: pdf
  module_type: langchain_parse
  parse_method: pdfminer
- file_type: csv
  module_type: langchain_parse
  parse_method: csv

Using all_files Mode

from autorag.parser import Parser

parser = Parser(
    data_path_glob="./data/mixed_files/*",
    project_dir="./my_project/parse"
)

# Parse all files with a single module (requires YAML with one module entry)
parser.start_parsing(yaml_path="./config/all_files_config.yaml", all_files=True)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment