Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Explodinggradients Ragas Document Loader Interface

From Leeroopedia


Knowledge Sources Domains Last Updated
explodinggradients/ragas LLM Evaluation, Test Data Generation, Document Processing 2026-02-10

Overview

Description

The Document Loader Interface is an external tool documentation page describing the document types accepted by Ragas from LangChain and LlamaIndex. Ragas does not implement its own document loaders. Instead, it consumes document objects from these external frameworks and converts them into internal Node objects for knowledge graph construction. This page documents the external APIs that Ragas depends on and how documents flow through the conversion process.

Usage

Users load documents with their preferred framework's loaders and pass the resulting objects to the appropriate TestsetGenerator method. Ragas handles the conversion internally.

Code Reference

Source Location

Component File Lines Description
LangChain document conversion src/ragas/testset/synthesizers/generate.py L193-203 Converts LCDocument objects to Node(type=DOCUMENT)
LlamaIndex document conversion src/ragas/testset/synthesizers/generate.py L272-283 Converts LlamaIndexDocument objects to Node(type=DOCUMENT), filtering empty text
Pre-chunked document conversion src/ragas/testset/synthesizers/generate.py L377-395 Converts strings or LCDocument objects to Node(type=CHUNK)

External Dependencies

Package Module Class Key Fields
langchain-core langchain_core.documents Document page_content: str, metadata: dict
llama-index-core llama_index.core.schema Document text: str, metadata: dict

Import

# LangChain documents
from langchain_core.documents import Document

# LlamaIndex documents
from llama_index.core import Document
# or
from llama_index.core.schema import Document

Consuming Methods on TestsetGenerator

# LangChain path
TestsetGenerator.generate_with_langchain_docs(
    documents: Sequence[langchain_core.documents.Document],
    testset_size: int,
    ...
) -> Union[Testset, Executor]

# LlamaIndex path
TestsetGenerator.generate_with_llamaindex_docs(
    documents: Sequence[llama_index.core.schema.Document],
    testset_size: int,
    ...
) -> Testset

# Pre-chunked path (strings or LangChain Documents)
TestsetGenerator.generate_with_chunks(
    chunks: Sequence[Union[langchain_core.documents.Document, str]],
    testset_size: int,
    ...
) -> Union[Testset, Executor]

I/O Contract

LangChain Document

Field Type Description
page_content str The text content of the document
metadata dict Arbitrary metadata (source path, page number, etc.)

LlamaIndex Document

Field Type Description
text str The text content of the document (filtered: empty/whitespace-only text is skipped)
metadata dict Arbitrary metadata

Internal Conversion Output

For both document types, Ragas creates:

Node(
    type=NodeType.DOCUMENT,  # or NodeType.CHUNK for generate_with_chunks()
    properties={
        "page_content": <document text>,
        "document_metadata": <document metadata dict>,
    },
)

Usage Examples

Loading Documents With LangChain

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load PDFs from a directory
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} document pages")

# Each document has page_content and metadata
print(documents[0].page_content[:100])
print(documents[0].metadata)
# {'source': './documents/report.pdf', 'page': 0}

# Pass to Ragas
generator = TestsetGenerator.from_langchain(
    llm=ChatOpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbeddings(),
)
testset = generator.generate_with_langchain_docs(
    documents=documents,
    testset_size=20,
)

Loading Documents With LlamaIndex

from llama_index.core import SimpleDirectoryReader, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from ragas.testset import TestsetGenerator

# Load documents
documents = SimpleDirectoryReader("./data/").load_data()
print(f"Loaded {len(documents)} documents")

# Each document has text and metadata
print(documents[0].text[:100])
print(documents[0].metadata)

# Pass to Ragas
generator = TestsetGenerator.from_llama_index(
    llm=OpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbedding(),
)
testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=20,
)

Using Pre-Chunked Content

from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Pre-chunked as plain strings
chunks_as_strings = [
    "Neural networks consist of layers of interconnected nodes.",
    "Each connection has a weight that is adjusted during training.",
    "Backpropagation computes the gradient of the loss function.",
]

# Or as LangChain Documents with metadata
chunks_as_docs = [
    Document(page_content="Neural networks consist of layers.", metadata={"source": "ch1"}),
    Document(page_content="Backpropagation computes gradients.", metadata={"source": "ch2"}),
]

generator = TestsetGenerator.from_langchain(
    llm=ChatOpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbeddings(),
)

# Both formats work
testset = generator.generate_with_chunks(
    chunks=chunks_as_strings,  # or chunks_as_docs
    testset_size=10,
)

Creating Documents Manually

from langchain_core.documents import Document

# Create documents from custom sources
documents = [
    Document(
        page_content="Ragas provides metrics for evaluating LLM applications.",
        metadata={"source": "ragas_docs", "section": "overview"},
    ),
    Document(
        page_content="Faithfulness measures how well the response aligns with context.",
        metadata={"source": "ragas_docs", "section": "metrics"},
    ),
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment