| Knowledge Sources |
Domains |
Last Updated
|
| explodinggradients/ragas |
LLM Evaluation, Test Data Generation, Document Processing |
2026-02-10
|
Overview
Description
The Document Loader Interface is an external tool documentation page describing the document types accepted by Ragas from LangChain and LlamaIndex. Ragas does not implement its own document loaders. Instead, it consumes document objects from these external frameworks and converts them into internal Node objects for knowledge graph construction. This page documents the external APIs that Ragas depends on and how documents flow through the conversion process.
Usage
Users load documents with their preferred framework's loaders and pass the resulting objects to the appropriate TestsetGenerator method. Ragas handles the conversion internally.
Code Reference
Source Location
| Component |
File |
Lines |
Description
|
| LangChain document conversion |
src/ragas/testset/synthesizers/generate.py |
L193-203 |
Converts LCDocument objects to Node(type=DOCUMENT)
|
| LlamaIndex document conversion |
src/ragas/testset/synthesizers/generate.py |
L272-283 |
Converts LlamaIndexDocument objects to Node(type=DOCUMENT), filtering empty text
|
| Pre-chunked document conversion |
src/ragas/testset/synthesizers/generate.py |
L377-395 |
Converts strings or LCDocument objects to Node(type=CHUNK)
|
External Dependencies
| Package |
Module |
Class |
Key Fields
|
langchain-core |
langchain_core.documents |
Document |
page_content: str, metadata: dict
|
llama-index-core |
llama_index.core.schema |
Document |
text: str, metadata: dict
|
Import
# LangChain documents
from langchain_core.documents import Document
# LlamaIndex documents
from llama_index.core import Document
# or
from llama_index.core.schema import Document
Consuming Methods on TestsetGenerator
# LangChain path
TestsetGenerator.generate_with_langchain_docs(
documents: Sequence[langchain_core.documents.Document],
testset_size: int,
...
) -> Union[Testset, Executor]
# LlamaIndex path
TestsetGenerator.generate_with_llamaindex_docs(
documents: Sequence[llama_index.core.schema.Document],
testset_size: int,
...
) -> Testset
# Pre-chunked path (strings or LangChain Documents)
TestsetGenerator.generate_with_chunks(
chunks: Sequence[Union[langchain_core.documents.Document, str]],
testset_size: int,
...
) -> Union[Testset, Executor]
I/O Contract
LangChain Document
| Field |
Type |
Description
|
page_content |
str |
The text content of the document
|
metadata |
dict |
Arbitrary metadata (source path, page number, etc.)
|
LlamaIndex Document
| Field |
Type |
Description
|
text |
str |
The text content of the document (filtered: empty/whitespace-only text is skipped)
|
metadata |
dict |
Arbitrary metadata
|
Internal Conversion Output
For both document types, Ragas creates:
Node(
type=NodeType.DOCUMENT, # or NodeType.CHUNK for generate_with_chunks()
properties={
"page_content": <document text>,
"document_metadata": <document metadata dict>,
},
)
Usage Examples
Loading Documents With LangChain
from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Load PDFs from a directory
loader = DirectoryLoader(
"./documents/",
glob="**/*.pdf",
loader_cls=PyPDFLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} document pages")
# Each document has page_content and metadata
print(documents[0].page_content[:100])
print(documents[0].metadata)
# {'source': './documents/report.pdf', 'page': 0}
# Pass to Ragas
generator = TestsetGenerator.from_langchain(
llm=ChatOpenAI(model="gpt-4o"),
embedding_model=OpenAIEmbeddings(),
)
testset = generator.generate_with_langchain_docs(
documents=documents,
testset_size=20,
)
Loading Documents With LlamaIndex
from llama_index.core import SimpleDirectoryReader, Document
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from ragas.testset import TestsetGenerator
# Load documents
documents = SimpleDirectoryReader("./data/").load_data()
print(f"Loaded {len(documents)} documents")
# Each document has text and metadata
print(documents[0].text[:100])
print(documents[0].metadata)
# Pass to Ragas
generator = TestsetGenerator.from_llama_index(
llm=OpenAI(model="gpt-4o"),
embedding_model=OpenAIEmbedding(),
)
testset = generator.generate_with_llamaindex_docs(
documents=documents,
testset_size=20,
)
Using Pre-Chunked Content
from langchain_core.documents import Document
from ragas.testset import TestsetGenerator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Pre-chunked as plain strings
chunks_as_strings = [
"Neural networks consist of layers of interconnected nodes.",
"Each connection has a weight that is adjusted during training.",
"Backpropagation computes the gradient of the loss function.",
]
# Or as LangChain Documents with metadata
chunks_as_docs = [
Document(page_content="Neural networks consist of layers.", metadata={"source": "ch1"}),
Document(page_content="Backpropagation computes gradients.", metadata={"source": "ch2"}),
]
generator = TestsetGenerator.from_langchain(
llm=ChatOpenAI(model="gpt-4o"),
embedding_model=OpenAIEmbeddings(),
)
# Both formats work
testset = generator.generate_with_chunks(
chunks=chunks_as_strings, # or chunks_as_docs
testset_size=10,
)
Creating Documents Manually
from langchain_core.documents import Document
# Create documents from custom sources
documents = [
Document(
page_content="Ragas provides metrics for evaluating LLM applications.",
metadata={"source": "ragas_docs", "section": "overview"},
),
Document(
page_content="Faithfulness measures how well the response aligns with context.",
metadata={"source": "ragas_docs", "section": "metrics"},
),
]
Related Pages