Implementation:Marker Inc Korea AutoRAG Bm25 And Vectordb Ingest
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, Indexing |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Concrete tools for ingesting corpus passages into BM25 and vector database retrieval indexes provided by AutoRAG's retrieval modules.
Description
bm25_ingest tokenizes corpus passages and stores them as a pickle file for BM25 retrieval. It supports multiple tokenizers: porter_stemmer, Korean (ko_kiwi, ko_kkma, ko_okt), Japanese (sudachipy), space-based, and HuggingFace tokenizers. vectordb_ingest_api uses async API embedding (OpenAI) with batched ingestion. vectordb_ingest_huggingface uses local HuggingFace models for embedding. Both check for existing documents to avoid duplicates.
Usage
These functions are called automatically by Evaluator.start_trial when the pipeline config includes retrieval modules. They can also be called directly for manual index management.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/nodes/lexicalretrieval/bm25.py (bm25_ingest), autorag/nodes/semanticretrieval/vectordb.py (vectordb_ingest_api, vectordb_ingest_huggingface)
- Lines: bm25.py L324-354, vectordb.py L242-259 (API), vectordb.py L265-292 (HuggingFace)
Signature
def bm25_ingest(
corpus_path: str,
corpus_data: pd.DataFrame,
bm25_tokenizer: str = "porter_stemmer"
) -> None:
"""
Ingest corpus into BM25 pickle file.
Args:
corpus_path: Path for BM25 pickle file (must end with .pkl).
corpus_data: Corpus DataFrame with doc_id and contents columns.
bm25_tokenizer: Tokenizer name (porter_stemmer, ko_kiwi, space, etc.).
"""
async def vectordb_ingest_api(
vectordb: BaseVectorStore,
corpus_data: pd.DataFrame,
) -> None:
"""Ingest corpus into vectordb using API-based embedding (OpenAI)."""
def vectordb_ingest_huggingface(
vectordb: BaseVectorStore,
corpus_data: pd.DataFrame,
) -> None:
"""Ingest corpus into vectordb using local HuggingFace embedding model."""
Import
from autorag.nodes.lexicalretrieval.bm25 import bm25_ingest
from autorag.nodes.semanticretrieval.vectordb import vectordb_ingest_api, vectordb_ingest_huggingface
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| corpus_path | str | Yes (BM25) | Path for BM25 pickle file |
| corpus_data | pd.DataFrame | Yes | Corpus with doc_id and contents columns |
| bm25_tokenizer | str | No | Tokenizer name (default: porter_stemmer) |
| vectordb | BaseVectorStore | Yes (VectorDB) | Vector store instance |
Outputs
| Name | Type | Description |
|---|---|---|
| BM25 pickle | File | Pickle at resources/bm25_*.pkl with tokenized corpus and passage IDs |
| Vector embeddings | VectorDB | Embeddings ingested into configured vector store |
Usage Examples
Manual BM25 Ingestion
import pandas as pd
from autorag.nodes.lexicalretrieval.bm25 import bm25_ingest
corpus_data = pd.read_parquet("./data/corpus.parquet")
bm25_ingest(
corpus_path="./resources/bm25_porter_stemmer.pkl",
corpus_data=corpus_data,
bm25_tokenizer="porter_stemmer"
)
Related Pages
Implements Principle
Requires Environment
- Environment:Marker_Inc_Korea_AutoRAG_Python_3_10_Runtime
- Environment:Marker_Inc_Korea_AutoRAG_GPU_PyTorch_Environment
- Environment:Marker_Inc_Korea_AutoRAG_Vector_Database_Backends