Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vibrantlabsai Ragas LangChain Document Loader

From Leeroopedia
Source langchain_core.documents.Document (external)
Domain Testset Generation
Last Updated 2026-02-12 00:00 GMT
Type External Tool Doc

Overview

This page documents how Ragas uses LangChain and LlamaIndex document loaders as the primary mechanism for ingesting source material into the testset generation pipeline. Ragas does not implement its own document loading logic; instead, it accepts documents that conform to the langchain_core.documents.Document interface (or the equivalent LlamaIndex Document class) and converts them into internal Node objects for knowledge graph construction.

The relevant integration is found in the TestsetGenerator class at:

src/ragas/testset/synthesizers/generate.py

External Dependency

The LangChain document model is imported as:

from langchain_core.documents import Document as LCDocument

A LCDocument object has two primary attributes:

  • page_content (str): The textual content of the document.
  • metadata (dict): A dictionary of arbitrary metadata (source path, page number, author, etc.).

LlamaIndex documents are similarly accepted via their Document class, which exposes a text attribute and a metadata dictionary.

Usage in Ragas

Loading Documents with LangChain

Users load documents using any LangChain-compatible loader and then pass them to TestsetGenerator.generate_with_langchain_docs:

from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.synthesizers.generate import TestsetGenerator

# Load documents using LangChain
loader = DirectoryLoader("./my_documents", glob="**/*.pdf")
documents = loader.load()

# Create a TestsetGenerator and generate test data
generator = TestsetGenerator.from_langchain(llm=my_llm, embedding_model=my_embeddings)
testset = generator.generate_with_langchain_docs(
    documents=documents,
    testset_size=50,
)

Loading Documents with LlamaIndex

For LlamaIndex users, the equivalent entry point is generate_with_llamaindex_docs:

from llama_index.core import SimpleDirectoryReader
from ragas.testset.synthesizers.generate import TestsetGenerator

# Load documents using LlamaIndex
documents = SimpleDirectoryReader("./my_documents").load_data()

# Create a TestsetGenerator and generate test data
generator = TestsetGenerator.from_llama_index(llm=my_llm, embedding_model=my_embeddings)
testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=50,
)

Loading Pre-Chunked Text

If documents are already chunked, users can pass them directly:

chunks = ["First chunk of text...", "Second chunk of text..."]
testset = generator.generate_with_chunks(
    chunks=chunks,
    testset_size=50,
)

Internal Conversion

Inside generate_with_langchain_docs, each LangChain document is converted to a Ragas Node:

from ragas.testset.graph import Node, NodeType

nodes = []
for doc in documents:
    node = Node(
        type=NodeType.DOCUMENT,
        properties={
            "page_content": doc.page_content,
            "document_metadata": doc.metadata,
        },
    )
    nodes.append(node)

kg = KnowledgeGraph(nodes=nodes)

For LlamaIndex documents, the conversion is similar but reads from doc.text instead of doc.page_content, and filters out documents with empty or None text content.

For pre-chunked input via generate_with_chunks, strings are converted directly and assigned NodeType.CHUNK rather than NodeType.DOCUMENT.

Key Method Signatures

def generate_with_langchain_docs(
    self,
    documents: t.Sequence[LCDocument],
    testset_size: int,
    transforms: t.Optional[Transforms] = None,
    transforms_llm: t.Optional[BaseRagasLLM] = None,
    transforms_embedding_model: t.Optional[BaseRagasEmbeddings] = None,
    query_distribution: t.Optional[QueryDistribution] = None,
    run_config: t.Optional[RunConfig] = None,
    callbacks: t.Optional[Callbacks] = None,
    token_usage_parser: t.Optional[TokenUsageParser] = None,
    with_debugging_logs=False,
    raise_exceptions: bool = True,
    return_executor: bool = False,
) -> t.Union[Testset, Executor]:
def generate_with_llamaindex_docs(
    self,
    documents: t.Sequence[LlamaIndexDocument],
    testset_size: int,
    transforms: t.Optional[Transforms] = None,
    transforms_llm: t.Optional[LlamaIndexLLM] = None,
    transforms_embedding_model: t.Optional[LlamaIndexEmbedding] = None,
    query_distribution: t.Optional[QueryDistribution] = None,
    run_config: t.Optional[RunConfig] = None,
    callbacks: t.Optional[Callbacks] = None,
    token_usage_parser: t.Optional[TokenUsageParser] = None,
    with_debugging_logs=False,
    raise_exceptions: bool = True,
):

Supported Loaders

Because Ragas accepts any conforming document object, the full range of LangChain and LlamaIndex loaders is supported. Common examples include:

  • LangChain: DirectoryLoader, PyPDFLoader, WebBaseLoader, CSVLoader, UnstructuredFileLoader
  • LlamaIndex: SimpleDirectoryReader, PDFReader, WikipediaReader, DatabaseReader

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment