Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Marker Inc Korea AutoRAG Bm25 And Vectordb Ingest

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, Indexing
Last Updated 2026-02-08 06:00 GMT

Overview

Concrete tools for ingesting corpus passages into BM25 and vector database retrieval indexes provided by AutoRAG's retrieval modules.

Description

bm25_ingest tokenizes corpus passages and stores them as a pickle file for BM25 retrieval. It supports multiple tokenizers: porter_stemmer, Korean (ko_kiwi, ko_kkma, ko_okt), Japanese (sudachipy), space-based, and HuggingFace tokenizers. vectordb_ingest_api uses async API embedding (OpenAI) with batched ingestion. vectordb_ingest_huggingface uses local HuggingFace models for embedding. Both check for existing documents to avoid duplicates.

Usage

These functions are called automatically by Evaluator.start_trial when the pipeline config includes retrieval modules. They can also be called directly for manual index management.

Code Reference

Source Location

  • Repository: AutoRAG
  • File: autorag/nodes/lexicalretrieval/bm25.py (bm25_ingest), autorag/nodes/semanticretrieval/vectordb.py (vectordb_ingest_api, vectordb_ingest_huggingface)
  • Lines: bm25.py L324-354, vectordb.py L242-259 (API), vectordb.py L265-292 (HuggingFace)

Signature

def bm25_ingest(
    corpus_path: str,
    corpus_data: pd.DataFrame,
    bm25_tokenizer: str = "porter_stemmer"
) -> None:
    """
    Ingest corpus into BM25 pickle file.

    Args:
        corpus_path: Path for BM25 pickle file (must end with .pkl).
        corpus_data: Corpus DataFrame with doc_id and contents columns.
        bm25_tokenizer: Tokenizer name (porter_stemmer, ko_kiwi, space, etc.).
    """

async def vectordb_ingest_api(
    vectordb: BaseVectorStore,
    corpus_data: pd.DataFrame,
) -> None:
    """Ingest corpus into vectordb using API-based embedding (OpenAI)."""

def vectordb_ingest_huggingface(
    vectordb: BaseVectorStore,
    corpus_data: pd.DataFrame,
) -> None:
    """Ingest corpus into vectordb using local HuggingFace embedding model."""

Import

from autorag.nodes.lexicalretrieval.bm25 import bm25_ingest
from autorag.nodes.semanticretrieval.vectordb import vectordb_ingest_api, vectordb_ingest_huggingface

I/O Contract

Inputs

Name Type Required Description
corpus_path str Yes (BM25) Path for BM25 pickle file
corpus_data pd.DataFrame Yes Corpus with doc_id and contents columns
bm25_tokenizer str No Tokenizer name (default: porter_stemmer)
vectordb BaseVectorStore Yes (VectorDB) Vector store instance

Outputs

Name Type Description
BM25 pickle File Pickle at resources/bm25_*.pkl with tokenized corpus and passage IDs
Vector embeddings VectorDB Embeddings ingested into configured vector store

Usage Examples

Manual BM25 Ingestion

import pandas as pd
from autorag.nodes.lexicalretrieval.bm25 import bm25_ingest

corpus_data = pd.read_parquet("./data/corpus.parquet")
bm25_ingest(
    corpus_path="./resources/bm25_porter_stemmer.pkl",
    corpus_data=corpus_data,
    bm25_tokenizer="porter_stemmer"
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment