Heuristic:Deepset ai Haystack Pipeline Deep Copy Safety
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Execution, Debugging |
| Last Updated | 2026-02-11 20:00 GMT |
Overview
Pipeline inputs are deep-copied before execution to prevent side effects, but Components, Tools, and Toolsets are excluded because they contain un-copyable objects like models and clients.
Description
When a Haystack pipeline runs, it deep-copies all input data before distributing it to components. This prevents unexpected mutations when the same input reference is passed to multiple components in the graph. However, certain object types are explicitly excluded from deep copying: `Component`, `Tool`, and `Toolset` instances, which often contain heavy objects (loaded ML models, API clients, database connections) that cannot or should not be duplicated. If deep copy fails for any other type, the system logs an info message and returns the original object rather than crashing.
Usage
Be aware of this heuristic when passing mutable objects as pipeline inputs or when debugging unexpected data mutations. If a component modifies its input in-place, other components receiving the same input will see different data than expected without this protection. Conversely, do not assume Component objects passed between pipeline steps are independent copies; they are always shared references.
The Insight (Rule of Thumb)
- Action: Pipeline inputs are automatically deep-copied; no user action needed for basic protection.
- Exception: Components, Tools, and Toolsets are never deep-copied. They are shared references.
- Trade-off: Deep copying adds overhead proportional to input size; very large document lists may see measurable latency.
- Fallback: If deep copy fails on any object, the original reference is used silently (logged at INFO level).
Reasoning
Deep Transformers pipelines pass data through directed graphs where a single input may fan out to multiple components. Without deep copying, one component modifying a Document's metadata would affect all downstream components sharing the same reference. Components are excluded because:
- They contain loaded models (e.g., BERT weights) that would consume huge memory if duplicated
- They contain API clients (e.g., OpenAI client) with connection pools that should not be copied
- They contain state (e.g., warm_up flag) that must remain shared across the pipeline
The deep copy with exceptions pattern from `haystack/core/pipeline/utils.py:17-54`:
def _deepcopy_with_exceptions(obj: Any) -> Any:
# Components and Tools often contain objects that we do not want to deepcopy
# or are not deepcopyable (e.g. models, clients, etc.).
if isinstance(obj, (Component, Tool, Toolset)):
return obj
try:
return deepcopy(obj)
except Exception as e:
logger.info(
"Deepcopy failed for object of type '{obj_type}'. Error: {error}. "
"Returning original object instead.",
obj_type=type(obj).__name__,
error=e,
)
return obj
Input deep copy invocation from `haystack/core/pipeline/base.py:1014-1017`:
# deepcopying the inputs prevents the Pipeline run logic from being altered unexpectedly
# when the same input reference is passed to multiple components.
for component_name, component_inputs in data.items():
data[component_name] = {k: _deepcopy_with_exceptions(v) for k, v in component_inputs.items()}