Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Deepset ai Haystack Pipeline Deep Copy Safety

From Leeroopedia
Knowledge Sources
Domains Pipeline_Execution, Debugging
Last Updated 2026-02-11 20:00 GMT

Overview

Pipeline inputs are deep-copied before execution to prevent side effects, but Components, Tools, and Toolsets are excluded because they contain un-copyable objects like models and clients.

Description

When a Haystack pipeline runs, it deep-copies all input data before distributing it to components. This prevents unexpected mutations when the same input reference is passed to multiple components in the graph. However, certain object types are explicitly excluded from deep copying: `Component`, `Tool`, and `Toolset` instances, which often contain heavy objects (loaded ML models, API clients, database connections) that cannot or should not be duplicated. If deep copy fails for any other type, the system logs an info message and returns the original object rather than crashing.

Usage

Be aware of this heuristic when passing mutable objects as pipeline inputs or when debugging unexpected data mutations. If a component modifies its input in-place, other components receiving the same input will see different data than expected without this protection. Conversely, do not assume Component objects passed between pipeline steps are independent copies; they are always shared references.

The Insight (Rule of Thumb)

  • Action: Pipeline inputs are automatically deep-copied; no user action needed for basic protection.
  • Exception: Components, Tools, and Toolsets are never deep-copied. They are shared references.
  • Trade-off: Deep copying adds overhead proportional to input size; very large document lists may see measurable latency.
  • Fallback: If deep copy fails on any object, the original reference is used silently (logged at INFO level).

Reasoning

Deep Transformers pipelines pass data through directed graphs where a single input may fan out to multiple components. Without deep copying, one component modifying a Document's metadata would affect all downstream components sharing the same reference. Components are excluded because:

  1. They contain loaded models (e.g., BERT weights) that would consume huge memory if duplicated
  2. They contain API clients (e.g., OpenAI client) with connection pools that should not be copied
  3. They contain state (e.g., warm_up flag) that must remain shared across the pipeline

The deep copy with exceptions pattern from `haystack/core/pipeline/utils.py:17-54`:

def _deepcopy_with_exceptions(obj: Any) -> Any:
    # Components and Tools often contain objects that we do not want to deepcopy
    # or are not deepcopyable (e.g. models, clients, etc.).
    if isinstance(obj, (Component, Tool, Toolset)):
        return obj

    try:
        return deepcopy(obj)
    except Exception as e:
        logger.info(
            "Deepcopy failed for object of type '{obj_type}'. Error: {error}. "
            "Returning original object instead.",
            obj_type=type(obj).__name__,
            error=e,
        )
        return obj

Input deep copy invocation from `haystack/core/pipeline/base.py:1014-1017`:

# deepcopying the inputs prevents the Pipeline run logic from being altered unexpectedly
# when the same input reference is passed to multiple components.
for component_name, component_inputs in data.items():
    data[component_name] = {k: _deepcopy_with_exceptions(v) for k, v in component_inputs.items()}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment