Implementation:Vibrantlabsai Ragas LangChain Document Loader
| Source | langchain_core.documents.Document (external) |
| Domain | Testset Generation |
| Last Updated | 2026-02-12 00:00 GMT |
| Type | External Tool Doc |
Overview
This page documents how Ragas uses LangChain and LlamaIndex document loaders as the primary mechanism for ingesting source material into the testset generation pipeline. Ragas does not implement its own document loading logic; instead, it accepts documents that conform to the langchain_core.documents.Document interface (or the equivalent LlamaIndex Document class) and converts them into internal Node objects for knowledge graph construction.
The relevant integration is found in the TestsetGenerator class at:
src/ragas/testset/synthesizers/generate.py
External Dependency
The LangChain document model is imported as:
from langchain_core.documents import Document as LCDocument
A LCDocument object has two primary attributes:
page_content(str): The textual content of the document.metadata(dict): A dictionary of arbitrary metadata (source path, page number, author, etc.).
LlamaIndex documents are similarly accepted via their Document class, which exposes a text attribute and a metadata dictionary.
Usage in Ragas
Loading Documents with LangChain
Users load documents using any LangChain-compatible loader and then pass them to TestsetGenerator.generate_with_langchain_docs:
from langchain_community.document_loaders import DirectoryLoader
from ragas.testset.synthesizers.generate import TestsetGenerator
# Load documents using LangChain
loader = DirectoryLoader("./my_documents", glob="**/*.pdf")
documents = loader.load()
# Create a TestsetGenerator and generate test data
generator = TestsetGenerator.from_langchain(llm=my_llm, embedding_model=my_embeddings)
testset = generator.generate_with_langchain_docs(
documents=documents,
testset_size=50,
)
Loading Documents with LlamaIndex
For LlamaIndex users, the equivalent entry point is generate_with_llamaindex_docs:
from llama_index.core import SimpleDirectoryReader
from ragas.testset.synthesizers.generate import TestsetGenerator
# Load documents using LlamaIndex
documents = SimpleDirectoryReader("./my_documents").load_data()
# Create a TestsetGenerator and generate test data
generator = TestsetGenerator.from_llama_index(llm=my_llm, embedding_model=my_embeddings)
testset = generator.generate_with_llamaindex_docs(
documents=documents,
testset_size=50,
)
Loading Pre-Chunked Text
If documents are already chunked, users can pass them directly:
chunks = ["First chunk of text...", "Second chunk of text..."]
testset = generator.generate_with_chunks(
chunks=chunks,
testset_size=50,
)
Internal Conversion
Inside generate_with_langchain_docs, each LangChain document is converted to a Ragas Node:
from ragas.testset.graph import Node, NodeType
nodes = []
for doc in documents:
node = Node(
type=NodeType.DOCUMENT,
properties={
"page_content": doc.page_content,
"document_metadata": doc.metadata,
},
)
nodes.append(node)
kg = KnowledgeGraph(nodes=nodes)
For LlamaIndex documents, the conversion is similar but reads from doc.text instead of doc.page_content, and filters out documents with empty or None text content.
For pre-chunked input via generate_with_chunks, strings are converted directly and assigned NodeType.CHUNK rather than NodeType.DOCUMENT.
Key Method Signatures
def generate_with_langchain_docs(
self,
documents: t.Sequence[LCDocument],
testset_size: int,
transforms: t.Optional[Transforms] = None,
transforms_llm: t.Optional[BaseRagasLLM] = None,
transforms_embedding_model: t.Optional[BaseRagasEmbeddings] = None,
query_distribution: t.Optional[QueryDistribution] = None,
run_config: t.Optional[RunConfig] = None,
callbacks: t.Optional[Callbacks] = None,
token_usage_parser: t.Optional[TokenUsageParser] = None,
with_debugging_logs=False,
raise_exceptions: bool = True,
return_executor: bool = False,
) -> t.Union[Testset, Executor]:
def generate_with_llamaindex_docs(
self,
documents: t.Sequence[LlamaIndexDocument],
testset_size: int,
transforms: t.Optional[Transforms] = None,
transforms_llm: t.Optional[LlamaIndexLLM] = None,
transforms_embedding_model: t.Optional[LlamaIndexEmbedding] = None,
query_distribution: t.Optional[QueryDistribution] = None,
run_config: t.Optional[RunConfig] = None,
callbacks: t.Optional[Callbacks] = None,
token_usage_parser: t.Optional[TokenUsageParser] = None,
with_debugging_logs=False,
raise_exceptions: bool = True,
):
Supported Loaders
Because Ragas accepts any conforming document object, the full range of LangChain and LlamaIndex loaders is supported. Common examples include:
- LangChain:
DirectoryLoader,PyPDFLoader,WebBaseLoader,CSVLoader,UnstructuredFileLoader - LlamaIndex:
SimpleDirectoryReader,PDFReader,WikipediaReader,DatabaseReader