Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Explodinggradients Ragas Document Loading

From Leeroopedia


Knowledge Sources Domains Last Updated
explodinggradients/ragas LLM Evaluation, Test Data Generation, Document Processing 2026-02-10

Overview

Description

Document Loading is the principle of abstracting different source document formats (PDF, web pages, plain text, structured data) into a uniform document representation that can be consumed by the knowledge graph construction pipeline. Ragas does not implement its own document loaders. Instead, it relies on the established document loading ecosystems of LangChain and LlamaIndex, accepting their respective Document objects as input. This design choice maximizes compatibility with existing tooling while keeping the Ragas codebase focused on evaluation and test generation.

Usage

Document loading is the entry point for the Ragas test generation pipeline. Users load their documents using their preferred framework's loaders and pass the resulting document objects to Ragas:

  • LangChain path: Load documents using any LangChain document loader (e.g., DirectoryLoader, WebBaseLoader, PyPDFLoader), then pass to TestsetGenerator.generate_with_langchain_docs().
  • LlamaIndex path: Load documents using any LlamaIndex reader (e.g., SimpleDirectoryReader), then pass to TestsetGenerator.generate_with_llamaindex_docs().
  • Pre-chunked path: Pass pre-chunked strings or LangChain Document objects directly to TestsetGenerator.generate_with_chunks().

In all cases, Ragas converts the external document objects into internal Node objects, extracting page_content (or text) into the node's page_content property and the document's metadata into the document_metadata property.

Theoretical Basis

Uniform Interface via External Standards: Rather than defining its own document format, Ragas adopts the document interfaces of two widely-used LLM frameworks. This provides several advantages:

  • Zero migration cost: Users who already use LangChain or LlamaIndex can pass their existing document objects directly without conversion.
  • Ecosystem leverage: Both frameworks offer hundreds of document loaders covering diverse sources (files, databases, APIs, web scraping). Ragas gains access to all of these without implementing any loader logic.
  • Format abstraction: Both frameworks abstract away format-specific parsing (PDF extraction, HTML parsing, etc.) behind a simple interface: a text field and a metadata dictionary.

Content and Metadata Separation: The document loading principle enforces a clean separation between content (page_content / text) and metadata (metadata). This separation is preserved through the entire pipeline:

  • Content is used for chunking, summarization, embedding, and query generation.
  • Metadata is carried along for traceability and can be used in filtering and reporting.

Flexible Entry Points: The three generate_with_* methods provide different levels of document preprocessing:

  • generate_with_langchain_docs() and generate_with_llamaindex_docs() accept full documents and apply internal chunking transforms.
  • generate_with_chunks() accepts pre-chunked content, allowing users who have already performed custom chunking to skip redundant processing.

Empty Content Filtering: During conversion from external documents to internal nodes, LlamaIndex documents with empty or whitespace-only text are automatically filtered out. This defensive behavior prevents downstream errors from empty nodes propagating through the pipeline.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment