Implementation:Explodinggradients Ragas Document Loader Interface

Knowledge Sources	Domains	Last Updated
explodinggradients/ragas	LLM Evaluation, Test Data Generation, Document Processing	2026-02-10

Overview

Description

The Document Loader Interface is an external tool documentation page describing the document types accepted by Ragas from LangChain and LlamaIndex. Ragas does not implement its own document loaders. Instead, it consumes document objects from these external frameworks and converts them into internal Node objects for knowledge graph construction. This page documents the external APIs that Ragas depends on and how documents flow through the conversion process.

Usage

Users load documents with their preferred framework's loaders and pass the resulting objects to the appropriate TestsetGenerator method. Ragas handles the conversion internally.

Code Reference

Source Location

Component	File	Lines	Description
LangChain document conversion	`src/ragas/testset/synthesizers/generate.py`	L193-203	Converts `LCDocument` objects to `Node(type=DOCUMENT)`
LlamaIndex document conversion	`src/ragas/testset/synthesizers/generate.py`	L272-283	Converts `LlamaIndexDocument` objects to `Node(type=DOCUMENT)`, filtering empty text
Pre-chunked document conversion	`src/ragas/testset/synthesizers/generate.py`	L377-395	Converts strings or `LCDocument` objects to `Node(type=CHUNK)`

External Dependencies

Package	Module	Class	Key Fields
`langchain-core`	`langchain_core.documents`	`Document`	`page_content: str`, `metadata: dict`
`llama-index-core`	`llama_index.core.schema`	`Document`	`text: str`, `metadata: dict`

Import

# LangChain documents
from langchain_core.documents import Document

# LlamaIndex documents
from llama_index.core import Document
# or
from llama_index.core.schema import Document

Consuming Methods on TestsetGenerator

# LangChain path
TestsetGenerator.generate_with_langchain_docs(
    documents: Sequence[langchain_core.documents.Document],
    testset_size: int,
    ...
) -> Union[Testset, Executor]

# LlamaIndex path
TestsetGenerator.generate_with_llamaindex_docs(
    documents: Sequence[llama_index.core.schema.Document],
    testset_size: int,
    ...
) -> Testset

# Pre-chunked path (strings or LangChain Documents)
TestsetGenerator.generate_with_chunks(
    chunks: Sequence[Union[langchain_core.documents.Document, str]],
    testset_size: int,
    ...
) -> Union[Testset, Executor]

I/O Contract

LangChain Document

Field	Type	Description
`page_content`	`str`	The text content of the document
`metadata`	`dict`	Arbitrary metadata (source path, page number, etc.)

LlamaIndex Document

Field	Type	Description
`text`	`str`	The text content of the document (filtered: empty/whitespace-only text is skipped)
`metadata`	`dict`	Arbitrary metadata

Internal Conversion Output

For both document types, Ragas creates:

Node(
    type=NodeType.DOCUMENT,  # or NodeType.CHUNK for generate_with_chunks()
    properties={
        "page_content": <document text>,
        "document_metadata": <document metadata dict>,
    },
)

Usage Examples

Loading Documents With LangChain

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load PDFs from a directory
loader = DirectoryLoader(
    "./documents/",
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} document pages")

# Each document has page_content and metadata
print(documents[0].page_content[:100])
print(documents[0].metadata)
# {'source': './documents/report.pdf', 'page': 0}

# Pass to Ragas
generator = TestsetGenerator.from_langchain(
    llm=ChatOpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbeddings(),
)
testset = generator.generate_with_langchain_docs(
    documents=documents,
    testset_size=20,
)

Loading Documents With LlamaIndex

from llama_index.core import SimpleDirectoryReader, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from ragas.testset import TestsetGenerator

# Load documents
documents = SimpleDirectoryReader("./data/").load_data()
print(f"Loaded {len(documents)} documents")

# Each document has text and metadata
print(documents[0].text[:100])
print(documents[0].metadata)

# Pass to Ragas
generator = TestsetGenerator.from_llama_index(
    llm=OpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbedding(),
)
testset = generator.generate_with_llamaindex_docs(
    documents=documents,
    testset_size=20,
)

Using Pre-Chunked Content

from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Pre-chunked as plain strings
chunks_as_strings = [
    "Neural networks consist of layers of interconnected nodes.",
    "Each connection has a weight that is adjusted during training.",
    "Backpropagation computes the gradient of the loss function.",
]

# Or as LangChain Documents with metadata
chunks_as_docs = [
    Document(page_content="Neural networks consist of layers.", metadata={"source": "ch1"}),
    Document(page_content="Backpropagation computes gradients.", metadata={"source": "ch2"}),
]

generator = TestsetGenerator.from_langchain(
    llm=ChatOpenAI(model="gpt-4o"),
    embedding_model=OpenAIEmbeddings(),
)

# Both formats work
testset = generator.generate_with_chunks(
    chunks=chunks_as_strings,  # or chunks_as_docs
    testset_size=10,
)

Creating Documents Manually

from langchain_core.documents import Document

# Create documents from custom sources
documents = [
    Document(
        page_content="Ragas provides metrics for evaluating LLM applications.",
        metadata={"source": "ragas_docs", "section": "overview"},
    ),
    Document(
        page_content="Faithfulness measures how well the response aligns with context.",
        metadata={"source": "ragas_docs", "section": "metrics"},
    ),
]

Related Pages

Principle:Explodinggradients_Ragas_Document_Loading
Implementation:Explodinggradients_Ragas_TestsetGenerator_Generate -- the generate_with_* methods that consume documents
Implementation:Explodinggradients_Ragas_KnowledgeGraph_Class -- the Node type that documents are converted into

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment