Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Utils Collections

From Leeroopedia


Knowledge Sources
Domains Utilities, Data_Structures
Last Updated 2026-02-10 10:00 GMT

Overview

Collection of utility functions for recursive dictionary manipulation, list normalization, and data structure transformations, provided by the DVC library.

Description

The dvc/utils/collections.py module provides six core functions that are used throughout the DVC codebase for safely merging, diffing, normalizing, and searching nested data structures.

apply_diff(src, dest) is a recursive function that synchronizes the contents of dest to match src while preserving the type and internal metadata of the dest container. This is critical for round-tripping YAML files through ruamel.yaml, which attaches comments, ordering, and line folding information to its custom mapping and sequence types. The function handles mappings and sequences (lists/tuples) separately: for mappings, it updates or adds keys from src and removes keys absent in src; for sequences, it either replaces the entire sequence (if lengths differ) or patches element-by-element. An AssertionError is raised if the source and destination types are incompatible.

merge_dicts(src, to_update) recursively merges to_update into src in place. When both values for a given key are dictionaries, the merge recurses; otherwise, the value from to_update overwrites the existing value in src. The helper _merge_item handles the per-key merge logic.

remove_missing_keys(src, to_update) prunes src by deleting any keys that do not exist in to_update. It recurses into nested dictionaries, ensuring that the structure of src is a subset of to_update after the operation.

ensure_list(item) normalizes an input that may be a string, an iterable of strings, or None into a consistent list[str]. None produces an empty list, a bare string produces a single-element list, and any other iterable is converted directly to a list.

nested_contains(dictionary, phrase) searches a nested dictionary for a key matching phrase that has a truthy value. It recurses into dict values and returns True on the first match, False if no match is found.

to_omegaconf(item) recursively converts custom container types (such as those returned by ruamel.yaml parsers) into plain Python dicts and lists. This sanitization is necessary before passing data to the OmegaConf library, which requires standard Python primitives.

Usage

Import these functions when you need to manipulate nested configuration data, merge parameter dictionaries, or normalize inputs. apply_diff is central to the Stage load/dump cycle. merge_dicts is used for composing parameter overrides. ensure_list is used extensively in argument parsing throughout DVC commands.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/utils/collections.py
  • Lines: 109 total

Signatures

def apply_diff(src, dest):
    ...

def to_omegaconf(item):
    ...

def remove_missing_keys(src, to_update):
    ...

def merge_dicts(src: dict, to_update: dict) -> dict:
    ...

def ensure_list(item: Union[Iterable[str], str, None]) -> list[str]:
    ...

def nested_contains(dictionary: dict, phrase: str) -> bool:
    ...

Import

from dvc.utils.collections import apply_diff, merge_dicts, ensure_list

I/O Contract

apply_diff

Name Type Required Description
src Mapping or Sequence Yes The source data structure containing the desired state.
dest Mapping or Sequence Yes The destination data structure to be modified in place. Must be the same container kind (mapping or sequence) as src.
Name Type Description
return None Modifies dest in place to match src while preserving dest's type and internal metadata (comments, ordering).

Exceptions:

  • AssertionError -- raised when src and dest are incompatible types (e.g., dict vs list).

merge_dicts

Name Type Required Description
src dict Yes The base dictionary to merge into (modified in place).
to_update dict Yes The dictionary whose key-value pairs will be merged into src.
Name Type Description
return dict The modified src dictionary with merged values from to_update.

remove_missing_keys

Name Type Required Description
src dict Yes The dictionary to prune (modified in place).
to_update dict Yes The reference dictionary. Keys in src not present in to_update are removed.
Name Type Description
return dict The pruned src dictionary containing only keys that exist in to_update.

ensure_list

Name Type Required Description
item Union[Iterable[str], str, None] Yes The input to normalize. None returns an empty list, a string returns a single-element list, and an iterable is converted to a list.
Name Type Description
return list[str] A list of strings derived from the input.

nested_contains

Name Type Required Description
dictionary dict Yes The nested dictionary to search.
phrase str Yes The key name to search for.
Name Type Description
return bool True if phrase is found as a key with a truthy value anywhere in the nested dictionary, False otherwise.

to_omegaconf

Name Type Required Description
item Any Yes A data structure (dict, list, or primitive) possibly containing custom container types from YAML parsers.
Name Type Description
return Any A new data structure with all custom containers replaced by plain Python dicts and lists. Primitive values are returned unchanged.

Usage Examples

apply_diff for YAML Round-Tripping

from dvc.utils.collections import apply_diff

# Preserves ruamel.yaml metadata while updating content
original_yaml_data = {"train": {"lr": 0.001, "epochs": 10}}
new_data = {"train": {"lr": 0.01, "epochs": 20, "batch_size": 32}}

apply_diff(new_data, original_yaml_data)
# original_yaml_data is now {"train": {"lr": 0.01, "epochs": 20, "batch_size": 32}}
# Comments and formatting from ruamel.yaml are preserved

Merging Parameter Dictionaries

from dvc.utils.collections import merge_dicts

base_params = {"train": {"lr": 0.001}, "data": {"path": "/data"}}
overrides = {"train": {"lr": 0.01, "epochs": 50}}

result = merge_dicts(base_params, overrides)
# result is {"train": {"lr": 0.01, "epochs": 50}, "data": {"path": "/data"}}
# base_params is also modified in place

Normalizing Inputs

from dvc.utils.collections import ensure_list

ensure_list(None)            # []
ensure_list("train.csv")     # ["train.csv"]
ensure_list(["a.csv", "b.csv"])  # ["a.csv", "b.csv"]

Searching Nested Configurations

from dvc.utils.collections import nested_contains

config = {"remote": {"s3": {"url": "s3://bucket", "endpointurl": "http://localhost"}}}

nested_contains(config, "endpointurl")  # True
nested_contains(config, "missing_key")  # False

Pruning and Converting

from dvc.utils.collections import remove_missing_keys, to_omegaconf

# Remove keys from src that are not in the reference
src = {"a": 1, "b": {"x": 2, "y": 3}, "c": 4}
ref = {"a": 10, "b": {"x": 20}}
remove_missing_keys(src, ref)
# src is now {"a": 1, "b": {"x": 2}}

# Convert custom YAML types to plain Python types for OmegaConf
from ruamel.yaml import YAML
yaml = YAML()
custom_data = yaml.load("key: value")
plain_data = to_omegaconf(custom_data)
# plain_data contains only standard dict/list/str types

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment