Implementation:Run llama Llama index SimpleDirectoryReader Load Data

Overview

SimpleDirectoryReader is the primary file-system-based document loader in LlamaIndex. It reads files from a directory (or an explicit list of file paths), automatically detects file types by extension, delegates to the appropriate file extractor, and returns a list of Document objects ready for indexing. It supports recursive traversal, extension filtering, hidden file exclusion, parallel loading, remote filesystems via fsspec, and custom metadata functions.

Data Ingestion RAG Pipeline LlamaIndex Core

Source File

File: llama-index-core/llama_index/core/readers/file/base.py, Lines 208-872
Class: SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin)

Import

from llama_index.core import SimpleDirectoryReader

Class Signature

class SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin):
    """Read files from a directory.

    Automatically detects file type and delegates to the appropriate
    file extractor. Returns a list of Document objects.
    """
    ...

Constructor Parameters

Parameter	Type	Default	Description
`input_dir`	`Optional[Union[Path, str]]`	`None`	Path to the directory to read files from. Either `input_dir` or `input_files` must be provided.
`input_files`	`Optional[list]`	`None`	Explicit list of file paths to read. Either `input_dir` or `input_files` must be provided.
`exclude`	`Optional[list]`	`None`	List of glob patterns for files/directories to exclude.
`exclude_hidden`	`bool`	`True`	Whether to exclude hidden files (files starting with `.`).
`recursive`	`bool`	`False`	Whether to recursively traverse subdirectories.
`encoding`	`str`	`"utf-8"`	Text encoding to use when reading files.
`filename_as_id`	`bool`	`False`	Whether to use the file name as the document ID instead of a generated hash.
`required_exts`	`Optional[list[str]]`	`None`	List of required file extensions (e.g., `[".pdf", ".txt"]`). Only files matching these extensions will be loaded.
`file_extractor`	`Optional[dict[str, BaseReader]]`	`None`	A mapping from file extension to a custom `BaseReader` instance for that type. Overrides the default extractors.
`num_files_limit`	`Optional[int]`	`None`	Maximum number of files to read. Useful for sampling or incremental loading.
`file_metadata`	`Optional[Callable]`	`None`	A callable that takes a file path and returns a metadata dictionary to attach to the document.
`raise_on_error`	`bool`	`False`	Whether to raise an exception on file read errors. If `False`, errors are logged and the file is skipped.
`fs`	`Optional[fsspec.AbstractFileSystem]`	`None`	An `fsspec` filesystem instance for reading from remote storage (S3, GCS, etc.).

Primary Method: load_data()

Signature

def load_data(
    self,
    show_progress: bool = False,
    num_workers: Optional[int] = None,
    fs: Optional[fsspec.AbstractFileSystem] = None,
) -> list[Document]:
    ...

Parameters

Parameter	Type	Default	Description
`show_progress`	`bool`	`False`	Display a progress bar during file loading (requires `tqdm`).
`num_workers`	`Optional[int]`	`None`	Number of parallel worker threads for loading files. `None` means single-threaded.
`fs`	`Optional[fsspec.AbstractFileSystem]`	`None`	Override the filesystem instance for this specific `load_data` call.

Return Value

Returns a list[Document] where each Document contains:

text: The extracted textual content.
metadata: A dictionary with at minimum file_path, file_name, file_type, file_size, creation_date, last_modified_date.
id_: A unique identifier (either generated or the filename if filename_as_id=True).

Usage Examples

Basic Directory Loading

from llama_index.core import SimpleDirectoryReader

# Load all supported files from a directory
reader = SimpleDirectoryReader(input_dir="./data")
documents = reader.load_data()

print(f"Loaded {len(documents)} documents")

Recursive Loading with Extension Filtering

from llama_index.core import SimpleDirectoryReader

# Recursively load only PDF and text files
reader = SimpleDirectoryReader(
    input_dir="./documents",
    recursive=True,
    required_exts=[".pdf", ".txt"],
    exclude=["drafts/*", "archive/*"],
)
documents = reader.load_data(show_progress=True)

Loading Specific Files

from llama_index.core import SimpleDirectoryReader

# Load specific files by path
reader = SimpleDirectoryReader(
    input_files=["./reports/q1_2024.pdf", "./reports/q2_2024.pdf"]
)
documents = reader.load_data()

Custom Metadata Function

from llama_index.core import SimpleDirectoryReader

def custom_metadata(file_path: str) -> dict:
    """Add custom metadata based on file path."""
    return {
        "department": "engineering" if "eng" in file_path else "general",
        "source": "internal_docs",
    }

reader = SimpleDirectoryReader(
    input_dir="./data",
    file_metadata=custom_metadata,
)
documents = reader.load_data()

Custom File Extractor

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader

# Use a custom PDF reader instead of the default
reader = SimpleDirectoryReader(
    input_dir="./data",
    file_extractor={".pdf": PDFReader(return_full_document=True)},
)
documents = reader.load_data()

Parallel Loading with Progress

from llama_index.core import SimpleDirectoryReader

# Load files in parallel using 4 workers
reader = SimpleDirectoryReader(
    input_dir="./large_dataset",
    recursive=True,
    num_files_limit=1000,
)
documents = reader.load_data(show_progress=True, num_workers=4)

Loading from Remote Filesystem (S3)

import s3fs
from llama_index.core import SimpleDirectoryReader

# Read from an S3 bucket
s3 = s3fs.S3FileSystem(anon=False)
reader = SimpleDirectoryReader(
    input_dir="my-bucket/documents/",
    fs=s3,
    recursive=True,
)
documents = reader.load_data()

Inheritance Hierarchy

SimpleDirectoryReader inherits from three base classes:

Base Class	Purpose
`BaseReader`	Provides the standard `load_data()` interface.
`ResourcesReaderMixin`	Adds resource listing and retrieval capabilities.
`FileSystemReaderMixin`	Adds filesystem-specific methods and the `fs` parameter.

File Type Detection

SimpleDirectoryReader maintains a default mapping of file extensions to reader classes. When a file is encountered:

The file extension is extracted.
If file_extractor contains a custom reader for that extension, it is used.
Otherwise, the default extractor for that extension is used.
If no extractor is available, the file is read as plain text.

Knowledge Sources

LlamaIndex SimpleDirectoryReader Guide LlamaIndex GitHub Repository

Principle

Principle:Run_llama_Llama_index_Document_Loading

Metadata

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment