Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Vibrantlabsai Ragas Document Loading

From Leeroopedia
Source ragas
Domain Testset Generation
Last Updated 2026-02-12 00:00 GMT

Overview

Document Loading is the first stage in the Ragas testset generation workflow. It concerns the ingestion of raw source documents -- PDFs, web pages, plain text files, and other unstructured content -- into a normalized document representation that downstream pipeline stages can consume. Without a reliable and flexible document loading step, the entire test data generation pipeline cannot operate, because all subsequent stages (knowledge graph construction, persona generation, query synthesis) depend on having well-structured document objects as input.

In Ragas, the document loading principle is deliberately decoupled from any single loader implementation. Ragas does not ship its own document loaders; instead, it relies on the document abstractions provided by external frameworks such as LangChain and LlamaIndex. This design choice reflects a separation-of-concerns philosophy: Ragas focuses on evaluation and test generation logic, while document ingestion is delegated to mature, purpose-built libraries that already support hundreds of file formats and data sources.

Key Concepts

  • Normalized Document Representation: Regardless of the original file format, every loaded document must be converted into a standard object that carries at minimum a text payload (page_content) and an associated metadata dictionary. This normalization ensures that the rest of the Ragas pipeline can treat all documents uniformly.
  • Framework Agnosticism: The TestsetGenerator class provides separate entry points for LangChain documents (generate_with_langchain_docs) and LlamaIndex documents (generate_with_llamaindex_docs), as well as raw pre-chunked text (generate_with_chunks). This means users are not locked into a single ecosystem.
  • Metadata Preservation: During loading, any metadata attached to the original document (such as source URL, page number, or author) is preserved and stored alongside the content. This metadata flows through the pipeline and can appear in the final test set for traceability.
  • Conversion to Internal Nodes: After loading, each document is converted into a Ragas Node with NodeType.DOCUMENT, storing the page content and document metadata as node properties. This conversion bridges the external document format and the internal knowledge graph representation.

How It Fits in the Pipeline

The testset generation workflow in Ragas proceeds through the following high-level stages:

  1. Document Loading -- Ingest source documents into normalized representations.
  2. Knowledge Graph Construction -- Transform loaded documents into a structured graph of nodes and relationships.
  3. Persona Generation -- Create synthetic user personas from the knowledge graph.
  4. Query Distribution Configuration -- Define the types and proportions of test queries.
  5. Test Sample Synthesis -- Generate test samples using LLMs and query synthesizers.
  6. Testset Export -- Convert the generated test set into evaluation-ready formats.

Document loading is the entry point. The quality and coverage of loaded documents directly determines the richness of the knowledge graph and, consequently, the diversity of the generated test set.

Design Considerations

  • Scalability: Document loaders should handle both small collections (a handful of documents) and large corpora (thousands of files) without requiring changes to the downstream pipeline.
  • Error Handling: Malformed or empty documents should be filtered out before they enter the knowledge graph. The generate_with_llamaindex_docs method, for example, explicitly skips documents where the text is None or empty.
  • Extensibility: Because Ragas accepts any object conforming to the LangChain Document or LlamaIndex Document interface, users can write custom loaders for proprietary formats without modifying Ragas itself.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment