Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Iterative Dvc YAML Dual Parser Strategy

From Leeroopedia




Knowledge Sources
Domains Optimization, File_Format
Last Updated 2026-02-10 10:00 GMT

Overview

Performance optimization technique using a fast YAML parser for reading and a slow format-preserving parser for writing, to maintain comments and formatting in `.dvc` files.

Description

DVC uses two different YAML parsing strategies for `.dvc` and `dvc.yaml` files. For reading, it uses a fast parser that strips comments and formatting (via standard YAML loading). For writing updates, it re-parses the original YAML text with ruamel.yaml (a slower but format-preserving parser), applies the changes to the parsed structure, and serializes it back. This ensures that user comments, key ordering, and formatting are preserved across DVC operations.

Usage

Apply this pattern when you need to update structured config files (YAML, TOML, etc.) while preserving user-added comments and formatting. This is especially important for DVC metafiles (`.dvc` files and `dvc.yaml`) that users edit manually and expect to remain human-readable.

The Insight (Rule of Thumb)

  • Action: Read YAML with a fast parser for in-memory data structures. When writing back, re-parse the original text with `ruamel.yaml` and apply a diff.
  • Value: Comments, key ordering, and whitespace formatting are preserved in user-facing files.
  • Trade-off: Double parsing on write operations. Slightly slower writes but maintains file readability and avoids unnecessary Git diffs.
  • Compatibility: Requires `ruamel.yaml >= 0.17.11` for the round-trip parser.

Reasoning

Users frequently add comments to their `.dvc` and `dvc.yaml` files to document data sources, parameter choices, or pipeline logic. A naive parse-modify-serialize cycle would strip all comments and reformat the file, creating noisy Git diffs and losing valuable documentation. The dual-parser approach solves this by using `ruamel.yaml`'s round-trip mode only when writing, avoiding the performance cost during reads.

The `apply_diff` function merges the new state into the preserved structure, updating only the values that changed while keeping everything else intact.

Code Evidence

Dual parser strategy from `dvc/stage/serialize.py:200-215`:

def to_single_stage_file(stage: "Stage", **kwargs):
    state = stage.dumpd(**kwargs)

    # When we load a stage we parse yaml with a fast parser, which strips
    # off all the comments and formatting. To retain those on update we do
    # a trick here:
    # - reparse the same yaml text with a slow but smart ruamel yaml parser
    # - apply changes to a returned structure
    # - serialize it
    text = stage._stage_text
    if text is None:
        return state

    saved_state = parse_yaml_for_update(text, stage.path)
    apply_diff(state, saved_state)
    return saved_state

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment