Implementation:Marker Inc Korea AutoRAG Parser Start Parsing
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Information Retrieval, Document Understanding |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for parsing raw document files into structured text DataFrames provided by the AutoRAG framework.
Description
The Parser class is the top-level entry point for the document parsing stage of AutoRAG's evaluation data creation workflow. It accepts a glob pattern identifying the raw document files to process and a project directory where outputs will be stored. The start_parsing() method reads a YAML configuration file that specifies which parser modules to use for each file type, loads the corresponding modules via get_param_combinations(), and delegates the actual parsing work to run_parser().
The run_parser() function at autorag/data/parse/run.py orchestrates multi-module parsing. It supports a default parser mapping for common file types (PDF via pdfminer, CSV, Markdown, HTML, XML) and automatically assigns default parsers for file types present in the data but not explicitly configured. Each parser module is wrapped by the parser_node decorator (autorag/data/parse/base.py), which normalizes outputs to a consistent four-column schema: texts, path, page, and last_modified_datetime. Results are saved as parquet files in the project directory, with a summary.csv recording execution times.
Usage
Import and use the Parser class when starting a new evaluation data creation pipeline from raw document files. This is typically the first programmatic step before chunking.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/parser.py (lines 12-37)
- Supporting files: autorag/data/parse/run.py (lines 38-141), autorag/data/parse/base.py (lines 14-71)
Signature
class Parser:
def __init__(self, data_path_glob: str, project_dir: Optional[str] = None):
...
def start_parsing(self, yaml_path: str, all_files: bool = False):
...
Import
from autorag.parser import Parser
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_path_glob | str | yes | Glob pattern matching the raw document files to parse (e.g., "./data/documents/*") |
| project_dir | Optional[str] | no | Directory where parsed output and config will be stored. Defaults to current working directory. |
| yaml_path | str | yes | Path to the YAML configuration file specifying parser modules and their parameters for each file type |
| all_files | bool | no | If True, uses a single parser for all files regardless of type. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| parsed_result.parquet | File (parquet) | Parquet file in project_dir containing the combined parsed results with columns: texts (str), path (str), page (int), last_modified_datetime (datetime) |
| {file_type}.parquet | File (parquet) | Per-file-type parquet files (e.g., pdf.parquet, csv.parquet) when all_files is False |
| summary.csv | File (CSV) | Summary of each parser module's execution including filename, module name, parameters, and execution time |
| parse_config.yaml | File (YAML) | Copy of the input YAML configuration stored in project_dir for reproducibility |
Usage Examples
Basic Usage
from autorag.parser import Parser
# Initialize parser with a glob pattern for all PDF and CSV files
parser = Parser(
data_path_glob="./data/raw_documents/*",
project_dir="./my_project/parse"
)
# Run parsing using a YAML configuration file
parser.start_parsing(yaml_path="./config/parse_config.yaml")
# After parsing, the output parquet file is at:
# ./my_project/parse/parsed_result.parquet
YAML Configuration Example
# parse_config.yaml
- file_type: pdf
module_type: langchain_parse
parse_method: pdfminer
- file_type: csv
module_type: langchain_parse
parse_method: csv
Using all_files Mode
from autorag.parser import Parser
parser = Parser(
data_path_glob="./data/mixed_files/*",
project_dir="./my_project/parse"
)
# Parse all files with a single module (requires YAML with one module entry)
parser.start_parsing(yaml_path="./config/all_files_config.yaml", all_files=True)