Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Apache Airflow File Discovery

From Leeroopedia


Knowledge Sources
Domains Core_Infrastructure, File_System
Last Updated 2026-02-08 21:00 GMT

Overview

Provides file discovery utilities that recursively search directories while respecting gitignore-style ignore patterns, supporting both glob and regexp syntax for filtering files.

Description

The file_discovery.py module implements a recursive directory walker that honors ignore files (similar to .gitignore). This is the mechanism used by Airflow for DAG file discovery, allowing users to control which files are scanned by placing ignore files in their DAG directories.

The module is built around an internal protocol and two concrete implementations:

  • _IgnoreRule (Protocol) -- Defines the interface for ignore rules with two static methods:
    • compile(pattern, base_dir, definition_file) -- Builds an ignore rule from a pattern string.
    • match(path, rules) -- Tests whether a candidate path matches any rule in a list.
  • _RegexpIgnoreRule(NamedTuple) -- Implements ignore rules using Python regular expressions. Each rule stores a compiled re.Pattern and the base directory. Matching is performed against the path relative to the rule's base directory. Invalid regex patterns are logged as warnings and skipped.
  • _GlobIgnoreRule(NamedTuple) -- Implements ignore rules using gitignore-style glob patterns via the pathspec library's GitWildMatchPattern. Respects the gitignore convention that patterns containing a path separator are matched relative to the ignore file's location, while other patterns match at any level. Supports negation patterns (exclusion rules), where later patterns can override earlier ones.
  • find_path_from_directory(base_dir_path, ignore_file_name, ignore_file_syntax="glob") -- The public API. Recursively walks base_dir_path, reading ignore files (named ignore_file_name) at each directory level. Returns a generator of file paths that are not ignored. Supports "glob" (default) and "regexp" syntax modes. Raises ValueError for unsupported syntax. Includes infinite recursion detection when following symlinks.

Key implementation details:

  • Ignore files themselves are excluded from the output.
  • Patterns accumulate as the walker descends: child directories inherit parent ignore rules.
  • Symlinks are followed (followlinks=True), with cycle detection to prevent infinite recursion.
  • Comments in ignore files (text after #) are stripped before pattern compilation.

Usage

Primarily used for DAG file discovery. Users place .airflowignore files in their DAG directories to exclude certain files or subdirectories from being parsed as DAGs.

Code Reference

Source Location

  • Repository: Apache_Airflow
  • File: shared/module_loading/src/airflow_shared/module_loading/file_discovery.py (197 lines)

Signature

def find_path_from_directory(
    base_dir_path: str | os.PathLike[str],
    ignore_file_name: str,
    ignore_file_syntax: str = "glob",
) -> Generator[str, None, None]:
    """
    Recursively search the base path for a list of file paths that should not be ignored.

    :param base_dir_path: the base path to be searched
    :param ignore_file_name: the file name in which specifies the patterns of files/dirs to be ignored
    :param ignore_file_syntax: the syntax of patterns in the ignore file: regexp or glob (default: glob)
    :return: a generator of file paths.
    """
    ...

Import

from airflow_shared.module_loading.file_discovery import find_path_from_directory

I/O Contract

Function Input Output Side Effects / Errors
find_path_from_directory os.PathLike, ignore_file_name: str, ignore_file_syntax: str (default "glob") Generator[str, None, None] -- yields absolute file paths not matched by ignore rules Reads ignore files from disk; raises ValueError for unsupported syntax; raises RuntimeError on symlink cycles
_IgnoreRule.compile pattern: str, base_dir: Path, definition_file: Path None Logs warning on invalid patterns
_IgnoreRule.match path: Path, rules: list[_IgnoreRule] bool None

Usage Examples

Basic DAG File Discovery

from airflow_shared.module_loading.file_discovery import find_path_from_directory

# Find all DAG files, respecting .airflowignore
dag_files = list(find_path_from_directory(
    base_dir_path="/opt/airflow/dags",
    ignore_file_name=".airflowignore",
    ignore_file_syntax="glob",
))
for path in dag_files:
    print(path)

Using Regexp Ignore Syntax

from airflow_shared.module_loading.file_discovery import find_path_from_directory

# Use regexp-style ignore patterns
files = find_path_from_directory(
    base_dir_path="/opt/airflow/dags",
    ignore_file_name=".airflowignore",
    ignore_file_syntax="regexp",
)
for path in files:
    process_dag_file(path)

Example .airflowignore File (glob syntax)

# Ignore all test files
test_*.py

# Ignore a specific subdirectory
archive/

# Ignore compiled Python files
__pycache__/
*.pyc

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment