Implementation:Run llama Llama index SimpleDirectoryReader Load Data
Overview
SimpleDirectoryReader is the primary file-system-based document loader in LlamaIndex. It reads files from a directory (or an explicit list of file paths), automatically detects file types by extension, delegates to the appropriate file extractor, and returns a list of Document objects ready for indexing. It supports recursive traversal, extension filtering, hidden file exclusion, parallel loading, remote filesystems via fsspec, and custom metadata functions.
Data Ingestion RAG Pipeline LlamaIndex Core
Source File
- File:
llama-index-core/llama_index/core/readers/file/base.py, Lines 208-872 - Class:
SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin)
Import
from llama_index.core import SimpleDirectoryReader
Class Signature
class SimpleDirectoryReader(BaseReader, ResourcesReaderMixin, FileSystemReaderMixin):
"""Read files from a directory.
Automatically detects file type and delegates to the appropriate
file extractor. Returns a list of Document objects.
"""
...
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input_dir |
Optional[Union[Path, str]] |
None |
Path to the directory to read files from. Either input_dir or input_files must be provided.
|
input_files |
Optional[list] |
None |
Explicit list of file paths to read. Either input_dir or input_files must be provided.
|
exclude |
Optional[list] |
None |
List of glob patterns for files/directories to exclude. |
exclude_hidden |
bool |
True |
Whether to exclude hidden files (files starting with .).
|
recursive |
bool |
False |
Whether to recursively traverse subdirectories. |
encoding |
str |
"utf-8" |
Text encoding to use when reading files. |
filename_as_id |
bool |
False |
Whether to use the file name as the document ID instead of a generated hash. |
required_exts |
Optional[list[str]] |
None |
List of required file extensions (e.g., [".pdf", ".txt"]). Only files matching these extensions will be loaded.
|
file_extractor |
Optional[dict[str, BaseReader]] |
None |
A mapping from file extension to a custom BaseReader instance for that type. Overrides the default extractors.
|
num_files_limit |
Optional[int] |
None |
Maximum number of files to read. Useful for sampling or incremental loading. |
file_metadata |
Optional[Callable] |
None |
A callable that takes a file path and returns a metadata dictionary to attach to the document. |
raise_on_error |
bool |
False |
Whether to raise an exception on file read errors. If False, errors are logged and the file is skipped.
|
fs |
Optional[fsspec.AbstractFileSystem] |
None |
An fsspec filesystem instance for reading from remote storage (S3, GCS, etc.).
|
Primary Method: load_data()
Signature
def load_data(
self,
show_progress: bool = False,
num_workers: Optional[int] = None,
fs: Optional[fsspec.AbstractFileSystem] = None,
) -> list[Document]:
...
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
show_progress |
bool |
False |
Display a progress bar during file loading (requires tqdm).
|
num_workers |
Optional[int] |
None |
Number of parallel worker threads for loading files. None means single-threaded.
|
fs |
Optional[fsspec.AbstractFileSystem] |
None |
Override the filesystem instance for this specific load_data call.
|
Return Value
Returns a list[Document] where each Document contains:
text: The extracted textual content.metadata: A dictionary with at minimumfile_path,file_name,file_type,file_size,creation_date,last_modified_date.id_: A unique identifier (either generated or the filename iffilename_as_id=True).
Usage Examples
Basic Directory Loading
from llama_index.core import SimpleDirectoryReader
# Load all supported files from a directory
reader = SimpleDirectoryReader(input_dir="./data")
documents = reader.load_data()
print(f"Loaded {len(documents)} documents")
Recursive Loading with Extension Filtering
from llama_index.core import SimpleDirectoryReader
# Recursively load only PDF and text files
reader = SimpleDirectoryReader(
input_dir="./documents",
recursive=True,
required_exts=[".pdf", ".txt"],
exclude=["drafts/*", "archive/*"],
)
documents = reader.load_data(show_progress=True)
Loading Specific Files
from llama_index.core import SimpleDirectoryReader
# Load specific files by path
reader = SimpleDirectoryReader(
input_files=["./reports/q1_2024.pdf", "./reports/q2_2024.pdf"]
)
documents = reader.load_data()
Custom Metadata Function
from llama_index.core import SimpleDirectoryReader
def custom_metadata(file_path: str) -> dict:
"""Add custom metadata based on file path."""
return {
"department": "engineering" if "eng" in file_path else "general",
"source": "internal_docs",
}
reader = SimpleDirectoryReader(
input_dir="./data",
file_metadata=custom_metadata,
)
documents = reader.load_data()
Custom File Extractor
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader
# Use a custom PDF reader instead of the default
reader = SimpleDirectoryReader(
input_dir="./data",
file_extractor={".pdf": PDFReader(return_full_document=True)},
)
documents = reader.load_data()
Parallel Loading with Progress
from llama_index.core import SimpleDirectoryReader
# Load files in parallel using 4 workers
reader = SimpleDirectoryReader(
input_dir="./large_dataset",
recursive=True,
num_files_limit=1000,
)
documents = reader.load_data(show_progress=True, num_workers=4)
Loading from Remote Filesystem (S3)
import s3fs
from llama_index.core import SimpleDirectoryReader
# Read from an S3 bucket
s3 = s3fs.S3FileSystem(anon=False)
reader = SimpleDirectoryReader(
input_dir="my-bucket/documents/",
fs=s3,
recursive=True,
)
documents = reader.load_data()
Inheritance Hierarchy
SimpleDirectoryReader inherits from three base classes:
| Base Class | Purpose |
|---|---|
BaseReader |
Provides the standard load_data() interface.
|
ResourcesReaderMixin |
Adds resource listing and retrieval capabilities. |
FileSystemReaderMixin |
Adds filesystem-specific methods and the fs parameter.
|
File Type Detection
SimpleDirectoryReader maintains a default mapping of file extensions to reader classes. When a file is encountered:
- The file extension is extracted.
- If
file_extractorcontains a custom reader for that extension, it is used. - Otherwise, the default extractor for that extension is used.
- If no extractor is available, the file is read as plain text.
Knowledge Sources
LlamaIndex SimpleDirectoryReader Guide LlamaIndex GitHub Repository
Principle
Principle:Run_llama_Llama_index_Document_Loading