Implementation:Apache Airflow File Discovery
| Knowledge Sources | |
|---|---|
| Domains | Core_Infrastructure, File_System |
| Last Updated | 2026-02-08 21:00 GMT |
Overview
Provides file discovery utilities that recursively search directories while respecting gitignore-style ignore patterns, supporting both glob and regexp syntax for filtering files.
Description
The file_discovery.py module implements a recursive directory walker that honors ignore files (similar to .gitignore). This is the mechanism used by Airflow for DAG file discovery, allowing users to control which files are scanned by placing ignore files in their DAG directories.
The module is built around an internal protocol and two concrete implementations:
_IgnoreRule(Protocol) -- Defines the interface for ignore rules with two static methods:compile(pattern, base_dir, definition_file)-- Builds an ignore rule from a pattern string.match(path, rules)-- Tests whether a candidate path matches any rule in a list.
_RegexpIgnoreRule(NamedTuple)-- Implements ignore rules using Python regular expressions. Each rule stores a compiledre.Patternand the base directory. Matching is performed against the path relative to the rule's base directory. Invalid regex patterns are logged as warnings and skipped.
_GlobIgnoreRule(NamedTuple)-- Implements ignore rules using gitignore-style glob patterns via thepathspeclibrary'sGitWildMatchPattern. Respects the gitignore convention that patterns containing a path separator are matched relative to the ignore file's location, while other patterns match at any level. Supports negation patterns (exclusion rules), where later patterns can override earlier ones.
find_path_from_directory(base_dir_path, ignore_file_name, ignore_file_syntax="glob")-- The public API. Recursively walksbase_dir_path, reading ignore files (namedignore_file_name) at each directory level. Returns a generator of file paths that are not ignored. Supports"glob"(default) and"regexp"syntax modes. RaisesValueErrorfor unsupported syntax. Includes infinite recursion detection when following symlinks.
Key implementation details:
- Ignore files themselves are excluded from the output.
- Patterns accumulate as the walker descends: child directories inherit parent ignore rules.
- Symlinks are followed (
followlinks=True), with cycle detection to prevent infinite recursion. - Comments in ignore files (text after
#) are stripped before pattern compilation.
Usage
Primarily used for DAG file discovery. Users place .airflowignore files in their DAG directories to exclude certain files or subdirectories from being parsed as DAGs.
Code Reference
Source Location
- Repository: Apache_Airflow
- File:
shared/module_loading/src/airflow_shared/module_loading/file_discovery.py(197 lines)
Signature
def find_path_from_directory(
base_dir_path: str | os.PathLike[str],
ignore_file_name: str,
ignore_file_syntax: str = "glob",
) -> Generator[str, None, None]:
"""
Recursively search the base path for a list of file paths that should not be ignored.
:param base_dir_path: the base path to be searched
:param ignore_file_name: the file name in which specifies the patterns of files/dirs to be ignored
:param ignore_file_syntax: the syntax of patterns in the ignore file: regexp or glob (default: glob)
:return: a generator of file paths.
"""
...
Import
from airflow_shared.module_loading.file_discovery import find_path_from_directory
I/O Contract
| Function | Input | Output | Side Effects / Errors |
|---|---|---|---|
find_path_from_directory |
os.PathLike, ignore_file_name: str, ignore_file_syntax: str (default "glob") |
Generator[str, None, None] -- yields absolute file paths not matched by ignore rules |
Reads ignore files from disk; raises ValueError for unsupported syntax; raises RuntimeError on symlink cycles
|
_IgnoreRule.compile |
pattern: str, base_dir: Path, definition_file: Path |
None | Logs warning on invalid patterns |
_IgnoreRule.match |
path: Path, rules: list[_IgnoreRule] |
bool |
None |
Usage Examples
Basic DAG File Discovery
from airflow_shared.module_loading.file_discovery import find_path_from_directory
# Find all DAG files, respecting .airflowignore
dag_files = list(find_path_from_directory(
base_dir_path="/opt/airflow/dags",
ignore_file_name=".airflowignore",
ignore_file_syntax="glob",
))
for path in dag_files:
print(path)
Using Regexp Ignore Syntax
from airflow_shared.module_loading.file_discovery import find_path_from_directory
# Use regexp-style ignore patterns
files = find_path_from_directory(
base_dir_path="/opt/airflow/dags",
ignore_file_name=".airflowignore",
ignore_file_syntax="regexp",
)
for path in files:
process_dag_file(path)
Example .airflowignore File (glob syntax)
# Ignore all test files
test_*.py
# Ignore a specific subdirectory
archive/
# Ignore compiled Python files
__pycache__/
*.pyc