Overview
Collection of utility functions for recursive dictionary manipulation, list normalization, and data structure transformations, provided by the DVC library.
Description
The dvc/utils/collections.py module provides six core functions that are used throughout the DVC codebase for safely merging, diffing, normalizing, and searching nested data structures.
apply_diff(src, dest) is a recursive function that synchronizes the contents of dest to match src while preserving the type and internal metadata of the dest container. This is critical for round-tripping YAML files through ruamel.yaml, which attaches comments, ordering, and line folding information to its custom mapping and sequence types. The function handles mappings and sequences (lists/tuples) separately: for mappings, it updates or adds keys from src and removes keys absent in src; for sequences, it either replaces the entire sequence (if lengths differ) or patches element-by-element. An AssertionError is raised if the source and destination types are incompatible.
merge_dicts(src, to_update) recursively merges to_update into src in place. When both values for a given key are dictionaries, the merge recurses; otherwise, the value from to_update overwrites the existing value in src. The helper _merge_item handles the per-key merge logic.
remove_missing_keys(src, to_update) prunes src by deleting any keys that do not exist in to_update. It recurses into nested dictionaries, ensuring that the structure of src is a subset of to_update after the operation.
ensure_list(item) normalizes an input that may be a string, an iterable of strings, or None into a consistent list[str]. None produces an empty list, a bare string produces a single-element list, and any other iterable is converted directly to a list.
nested_contains(dictionary, phrase) searches a nested dictionary for a key matching phrase that has a truthy value. It recurses into dict values and returns True on the first match, False if no match is found.
to_omegaconf(item) recursively converts custom container types (such as those returned by ruamel.yaml parsers) into plain Python dicts and lists. This sanitization is necessary before passing data to the OmegaConf library, which requires standard Python primitives.
Usage
Import these functions when you need to manipulate nested configuration data, merge parameter dictionaries, or normalize inputs. apply_diff is central to the Stage load/dump cycle. merge_dicts is used for composing parameter overrides. ensure_list is used extensively in argument parsing throughout DVC commands.
Code Reference
Source Location
- Repository: DVC
- File:
dvc/utils/collections.py
- Lines: 109 total
Signatures
def apply_diff(src, dest):
...
def to_omegaconf(item):
...
def remove_missing_keys(src, to_update):
...
def merge_dicts(src: dict, to_update: dict) -> dict:
...
def ensure_list(item: Union[Iterable[str], str, None]) -> list[str]:
...
def nested_contains(dictionary: dict, phrase: str) -> bool:
...
Import
from dvc.utils.collections import apply_diff, merge_dicts, ensure_list
I/O Contract
apply_diff
| Name |
Type |
Required |
Description
|
| src |
Mapping or Sequence |
Yes |
The source data structure containing the desired state.
|
| dest |
Mapping or Sequence |
Yes |
The destination data structure to be modified in place. Must be the same container kind (mapping or sequence) as src.
|
| Name |
Type |
Description
|
| return |
None |
Modifies dest in place to match src while preserving dest's type and internal metadata (comments, ordering).
|
Exceptions:
- AssertionError -- raised when src and dest are incompatible types (e.g., dict vs list).
merge_dicts
| Name |
Type |
Required |
Description
|
| src |
dict |
Yes |
The base dictionary to merge into (modified in place).
|
| to_update |
dict |
Yes |
The dictionary whose key-value pairs will be merged into src.
|
| Name |
Type |
Description
|
| return |
dict |
The modified src dictionary with merged values from to_update.
|
remove_missing_keys
| Name |
Type |
Required |
Description
|
| src |
dict |
Yes |
The dictionary to prune (modified in place).
|
| to_update |
dict |
Yes |
The reference dictionary. Keys in src not present in to_update are removed.
|
| Name |
Type |
Description
|
| return |
dict |
The pruned src dictionary containing only keys that exist in to_update.
|
ensure_list
| Name |
Type |
Required |
Description
|
| item |
Union[Iterable[str], str, None] |
Yes |
The input to normalize. None returns an empty list, a string returns a single-element list, and an iterable is converted to a list.
|
| Name |
Type |
Description
|
| return |
list[str] |
A list of strings derived from the input.
|
nested_contains
| Name |
Type |
Required |
Description
|
| dictionary |
dict |
Yes |
The nested dictionary to search.
|
| phrase |
str |
Yes |
The key name to search for.
|
| Name |
Type |
Description
|
| return |
bool |
True if phrase is found as a key with a truthy value anywhere in the nested dictionary, False otherwise.
|
to_omegaconf
| Name |
Type |
Required |
Description
|
| item |
Any |
Yes |
A data structure (dict, list, or primitive) possibly containing custom container types from YAML parsers.
|
| Name |
Type |
Description
|
| return |
Any |
A new data structure with all custom containers replaced by plain Python dicts and lists. Primitive values are returned unchanged.
|
Usage Examples
apply_diff for YAML Round-Tripping
from dvc.utils.collections import apply_diff
# Preserves ruamel.yaml metadata while updating content
original_yaml_data = {"train": {"lr": 0.001, "epochs": 10}}
new_data = {"train": {"lr": 0.01, "epochs": 20, "batch_size": 32}}
apply_diff(new_data, original_yaml_data)
# original_yaml_data is now {"train": {"lr": 0.01, "epochs": 20, "batch_size": 32}}
# Comments and formatting from ruamel.yaml are preserved
Merging Parameter Dictionaries
from dvc.utils.collections import merge_dicts
base_params = {"train": {"lr": 0.001}, "data": {"path": "/data"}}
overrides = {"train": {"lr": 0.01, "epochs": 50}}
result = merge_dicts(base_params, overrides)
# result is {"train": {"lr": 0.01, "epochs": 50}, "data": {"path": "/data"}}
# base_params is also modified in place
Normalizing Inputs
from dvc.utils.collections import ensure_list
ensure_list(None) # []
ensure_list("train.csv") # ["train.csv"]
ensure_list(["a.csv", "b.csv"]) # ["a.csv", "b.csv"]
Searching Nested Configurations
from dvc.utils.collections import nested_contains
config = {"remote": {"s3": {"url": "s3://bucket", "endpointurl": "http://localhost"}}}
nested_contains(config, "endpointurl") # True
nested_contains(config, "missing_key") # False
Pruning and Converting
from dvc.utils.collections import remove_missing_keys, to_omegaconf
# Remove keys from src that are not in the reference
src = {"a": 1, "b": {"x": 2, "y": 3}, "c": 4}
ref = {"a": 10, "b": {"x": 20}}
remove_missing_keys(src, ref)
# src is now {"a": 1, "b": {"x": 2}}
# Convert custom YAML types to plain Python types for OmegaConf
from ruamel.yaml import YAML
yaml = YAML()
custom_data = yaml.load("key: value")
plain_data = to_omegaconf(custom_data)
# plain_data contains only standard dict/list/str types
Related Pages
Implements Principle