Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index SimpleDirectoryReader Load Data

From Leeroopedia

Overview

SimpleDirectoryReader is the primary file-system-based document loader in LlamaIndex. It reads files from a directory (or an explicit list of file paths), automatically detects file types by extension, delegates to the appropriate file extractor, and returns a list of Document objects ready for indexing. It supports recursive traversal, extension filtering, hidden file exclusion, parallel loading, remote filesystems via fsspec, and custom metadata functions.

Data Ingestion RAG Pipeline LlamaIndex Core

Source File

  • File: llama-index-core/llama_index/core/readers/file/base.py, Lines 208-872
  • Class: SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin)

Import

from llama_index.core import SimpleDirectoryReader

Class Signature

class SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin):
    """Read files from a directory.

    Automatically detects file type and delegates to the appropriate
    file extractor. Returns a list of Document objects.
    """
    ...

Constructor Parameters

Parameter Type Default Description
input_dir Optional[Union[Path, str]] None Path to the directory to read files from. Either input_dir or input_files must be provided.
input_files Optional[list] None Explicit list of file paths to read. Either input_dir or input_files must be provided.
exclude Optional[list] None List of glob patterns for files/directories to exclude.
exclude_hidden bool True Whether to exclude hidden files (files starting with .).
recursive bool False Whether to recursively traverse subdirectories.
encoding str "utf-8" Text encoding to use when reading files.
filename_as_id bool False Whether to use the file name as the document ID instead of a generated hash.
required_exts Optional[list[str]] None List of required file extensions (e.g., [".pdf", ".txt"]). Only files matching these extensions will be loaded.
file_extractor Optional[dict[str, BaseReader]] None A mapping from file extension to a custom BaseReader instance for that type. Overrides the default extractors.
num_files_limit Optional[int] None Maximum number of files to read. Useful for sampling or incremental loading.
file_metadata Optional[Callable] None A callable that takes a file path and returns a metadata dictionary to attach to the document.
raise_on_error bool False Whether to raise an exception on file read errors. If False, errors are logged and the file is skipped.
fs Optional[fsspec.AbstractFileSystem] None An fsspec filesystem instance for reading from remote storage (S3, GCS, etc.).

Primary Method: load_data()

Signature

def load_data(
    self,
    show_progress: bool = False,
    num_workers: Optional[int] = None,
    fs: Optional[fsspec.AbstractFileSystem] = None,
) -> list[Document]:
    ...

Parameters

Parameter Type Default Description
show_progress bool False Display a progress bar during file loading (requires tqdm).
num_workers Optional[int] None Number of parallel worker threads for loading files. None means single-threaded.
fs Optional[fsspec.AbstractFileSystem] None Override the filesystem instance for this specific load_data call.

Return Value

Returns a list[Document] where each Document contains:

  • text: The extracted textual content.
  • metadata: A dictionary with at minimum file_path, file_name, file_type, file_size, creation_date, last_modified_date.
  • id_: A unique identifier (either generated or the filename if filename_as_id=True).

Usage Examples

Basic Directory Loading

from llama_index.core import SimpleDirectoryReader

# Load all supported files from a directory
reader = SimpleDirectoryReader(input_dir="./data")
documents = reader.load_data()

print(f"Loaded {len(documents)} documents")

Recursive Loading with Extension Filtering

from llama_index.core import SimpleDirectoryReader

# Recursively load only PDF and text files
reader = SimpleDirectoryReader(
    input_dir="./documents",
    recursive=True,
    required_exts=[".pdf", ".txt"],
    exclude=["drafts/*", "archive/*"],
)
documents = reader.load_data(show_progress=True)

Loading Specific Files

from llama_index.core import SimpleDirectoryReader

# Load specific files by path
reader = SimpleDirectoryReader(
    input_files=["./reports/q1_2024.pdf", "./reports/q2_2024.pdf"]
)
documents = reader.load_data()

Custom Metadata Function

from llama_index.core import SimpleDirectoryReader

def custom_metadata(file_path: str) -> dict:
    """Add custom metadata based on file path."""
    return {
        "department": "engineering" if "eng" in file_path else "general",
        "source": "internal_docs",
    }

reader = SimpleDirectoryReader(
    input_dir="./data",
    file_metadata=custom_metadata,
)
documents = reader.load_data()

Custom File Extractor

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader

# Use a custom PDF reader instead of the default
reader = SimpleDirectoryReader(
    input_dir="./data",
    file_extractor={".pdf": PDFReader(return_full_document=True)},
)
documents = reader.load_data()

Parallel Loading with Progress

from llama_index.core import SimpleDirectoryReader

# Load files in parallel using 4 workers
reader = SimpleDirectoryReader(
    input_dir="./large_dataset",
    recursive=True,
    num_files_limit=1000,
)
documents = reader.load_data(show_progress=True, num_workers=4)

Loading from Remote Filesystem (S3)

import s3fs
from llama_index.core import SimpleDirectoryReader

# Read from an S3 bucket
s3 = s3fs.S3FileSystem(anon=False)
reader = SimpleDirectoryReader(
    input_dir="my-bucket/documents/",
    fs=s3,
    recursive=True,
)
documents = reader.load_data()

Inheritance Hierarchy

SimpleDirectoryReader inherits from three base classes:

Base Class Purpose
BaseReader Provides the standard load_data() interface.
ResourcesReaderMixin Adds resource listing and retrieval capabilities.
FileSystemReaderMixin Adds filesystem-specific methods and the fs parameter.

File Type Detection

SimpleDirectoryReader maintains a default mapping of file extensions to reader classes. When a file is encountered:

  1. The file extension is extracted.
  2. If file_extractor contains a custom reader for that extension, it is used.
  3. Otherwise, the default extractor for that extension is used.
  4. If no extractor is available, the file is read as plain text.

Knowledge Sources

LlamaIndex SimpleDirectoryReader Guide LlamaIndex GitHub Repository

Principle

Principle:Run_llama_Llama_index_Document_Loading

Metadata

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment