Principle:Explodinggradients Ragas Document Loading
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| explodinggradients/ragas | LLM Evaluation, Test Data Generation, Document Processing | 2026-02-10 |
Overview
Description
Document Loading is the principle of abstracting different source document formats (PDF, web pages, plain text, structured data) into a uniform document representation that can be consumed by the knowledge graph construction pipeline. Ragas does not implement its own document loaders. Instead, it relies on the established document loading ecosystems of LangChain and LlamaIndex, accepting their respective Document objects as input. This design choice maximizes compatibility with existing tooling while keeping the Ragas codebase focused on evaluation and test generation.
Usage
Document loading is the entry point for the Ragas test generation pipeline. Users load their documents using their preferred framework's loaders and pass the resulting document objects to Ragas:
- LangChain path: Load documents using any LangChain document loader (e.g.,
DirectoryLoader,WebBaseLoader,PyPDFLoader), then pass toTestsetGenerator.generate_with_langchain_docs(). - LlamaIndex path: Load documents using any LlamaIndex reader (e.g.,
SimpleDirectoryReader), then pass toTestsetGenerator.generate_with_llamaindex_docs(). - Pre-chunked path: Pass pre-chunked strings or LangChain
Documentobjects directly toTestsetGenerator.generate_with_chunks().
In all cases, Ragas converts the external document objects into internal Node objects, extracting page_content (or text) into the node's page_content property and the document's metadata into the document_metadata property.
Theoretical Basis
Uniform Interface via External Standards: Rather than defining its own document format, Ragas adopts the document interfaces of two widely-used LLM frameworks. This provides several advantages:
- Zero migration cost: Users who already use LangChain or LlamaIndex can pass their existing document objects directly without conversion.
- Ecosystem leverage: Both frameworks offer hundreds of document loaders covering diverse sources (files, databases, APIs, web scraping). Ragas gains access to all of these without implementing any loader logic.
- Format abstraction: Both frameworks abstract away format-specific parsing (PDF extraction, HTML parsing, etc.) behind a simple interface: a text field and a metadata dictionary.
Content and Metadata Separation: The document loading principle enforces a clean separation between content (page_content / text) and metadata (metadata). This separation is preserved through the entire pipeline:
- Content is used for chunking, summarization, embedding, and query generation.
- Metadata is carried along for traceability and can be used in filtering and reporting.
Flexible Entry Points: The three generate_with_* methods provide different levels of document preprocessing:
generate_with_langchain_docs()andgenerate_with_llamaindex_docs()accept full documents and apply internal chunking transforms.generate_with_chunks()accepts pre-chunked content, allowing users who have already performed custom chunking to skip redundant processing.
Empty Content Filtering: During conversion from external documents to internal nodes, LlamaIndex documents with empty or whitespace-only text are automatically filtered out. This defensive behavior prevents downstream errors from empty nodes propagating through the pipeline.
Related Pages
- Implementation:Explodinggradients_Ragas_Document_Loader_Interface
- Principle:Explodinggradients_Ragas_Knowledge_Graph_Construction -- the knowledge graph that documents are loaded into
- Principle:Explodinggradients_Ragas_Test_Query_Synthesis -- the downstream consumer of loaded and processed documents