Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Document Loading

From Leeroopedia

Overview

Document Loading is the principle of ingesting data from heterogeneous sources -- files, databases, APIs, web pages -- and converting it into a standardized, framework-native representation that downstream components can process uniformly. In LlamaIndex, this standardized representation is the Document object, and the loading step is the critical first stage of every Retrieval-Augmented Generation (RAG) pipeline.

Without a robust document loading layer, the rest of the pipeline -- chunking, embedding, indexing, retrieval, and synthesis -- cannot function. Document loading is therefore the foundation upon which all subsequent RAG operations are built.

Data Ingestion RAG Pipeline LlamaIndex Core

Data Ingestion as the First Step in RAG

A RAG pipeline follows a well-defined sequence:

  1. Load raw data from sources into Document objects.
  2. Transform documents into smaller, indexable nodes (chunking).
  3. Embed nodes into dense vector representations.
  4. Index the embedded nodes for efficient retrieval.
  5. Retrieve relevant nodes given a query.
  6. Synthesize a response using the retrieved context and an LLM.

The loading step (step 1) determines what data is available to the entire pipeline. If documents are loaded incorrectly -- with missing content, corrupted metadata, or improper encoding -- every subsequent step inherits these defects. This makes document loading a quality gate for the entire system.

The Document Abstraction

LlamaIndex's Document class provides a uniform container with:

  • text: The raw textual content of the document.
  • metadata: A dictionary of key-value pairs (file name, creation date, author, custom tags).
  • id_: A unique identifier for deduplication and tracking.
  • relationships: Links to other documents or nodes (e.g., parent-child, source).

This abstraction decouples the origin of the data from its representation, allowing the same downstream pipeline to process PDFs, CSVs, web pages, and database records without modification.

Format-Agnostic Reading with Automatic File Type Detection

A practical document loading system must handle the reality that data exists in many formats:

Category Formats Challenges
Plain text .txt, .md, .csv Encoding detection, delimiter handling
Rich documents .pdf, .docx, .pptx Layout extraction, embedded images, tables
Web content .html, .xml Tag stripping, boilerplate removal
Structured data .json, .jsonl Schema mapping, nested structure flattening
Media .jpg, .png, .mp3 OCR, transcription, multimodal processing

The principle of format-agnostic reading means the loading interface should be the same regardless of the underlying file type. The caller should not need to know whether a file is a PDF or a Word document -- the loader detects the file type (typically by extension) and delegates to an appropriate file extractor (a specialized reader) automatically.

This detection-and-delegation pattern provides:

  • Simplicity: One API call loads any supported file type.
  • Extensibility: New file types are supported by registering new extractors.
  • Consistency: All formats produce the same Document output structure.

Design Considerations

Metadata Enrichment

Effective document loading goes beyond extracting text. Rich metadata -- file path, file name, creation date, file size, page numbers -- should be attached automatically. This metadata is invaluable for:

  • Filtering at retrieval time (e.g., "only search documents from 2024").
  • Citation in generated responses (e.g., "according to report.pdf, page 3...").
  • Deduplication when the same content appears in multiple files.

Users should also be able to supply a custom metadata function that computes additional metadata per file.

Recursive Directory Traversal

Real-world data is organized in directory hierarchies. A document loader should support:

  • Recursive traversal to walk nested directories.
  • Extension filtering to limit which file types are loaded.
  • Exclusion patterns to skip irrelevant directories (e.g., __pycache__, .git).
  • Hidden file exclusion to ignore system files by default.

Scalability

For large document collections, the loader should support:

  • Progress reporting so users can monitor long-running ingestion jobs.
  • Parallel loading via multiple worker threads or processes.
  • File count limits for sampling or incremental loading.
  • Remote filesystem support (e.g., S3, GCS) via fsspec abstraction.

Error Handling

Not all files can be loaded successfully. A robust loader provides:

  • Graceful degradation: Skip unreadable files and continue.
  • Error reporting: Log which files failed and why.
  • Configurable strictness: Optionally raise on first error for debugging.

Relationship to Other Principles

Document loading feeds directly into:

  • Node parsing / chunking: Documents are split into nodes using the configured node parser.
  • Embedding: Node text is converted to vectors using the configured embedding model.
  • Index construction: Embedded nodes are stored in an index for retrieval.

All of these downstream steps depend on the quality and completeness of the loaded documents.

Knowledge Sources

LlamaIndex Data Connectors Documentation LlamaIndex SimpleDirectoryReader Guide

Implementation

Implementation:Run_llama_Llama_index_SimpleDirectoryReader_Load_Data

Metadata

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment