Heuristic:Iterative Dvc YAML Dual Parser Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, File_Format |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Performance optimization technique using a fast YAML parser for reading and a slow format-preserving parser for writing, to maintain comments and formatting in `.dvc` files.
Description
DVC uses two different YAML parsing strategies for `.dvc` and `dvc.yaml` files. For reading, it uses a fast parser that strips comments and formatting (via standard YAML loading). For writing updates, it re-parses the original YAML text with ruamel.yaml (a slower but format-preserving parser), applies the changes to the parsed structure, and serializes it back. This ensures that user comments, key ordering, and formatting are preserved across DVC operations.
Usage
Apply this pattern when you need to update structured config files (YAML, TOML, etc.) while preserving user-added comments and formatting. This is especially important for DVC metafiles (`.dvc` files and `dvc.yaml`) that users edit manually and expect to remain human-readable.
The Insight (Rule of Thumb)
- Action: Read YAML with a fast parser for in-memory data structures. When writing back, re-parse the original text with `ruamel.yaml` and apply a diff.
- Value: Comments, key ordering, and whitespace formatting are preserved in user-facing files.
- Trade-off: Double parsing on write operations. Slightly slower writes but maintains file readability and avoids unnecessary Git diffs.
- Compatibility: Requires `ruamel.yaml >= 0.17.11` for the round-trip parser.
Reasoning
Users frequently add comments to their `.dvc` and `dvc.yaml` files to document data sources, parameter choices, or pipeline logic. A naive parse-modify-serialize cycle would strip all comments and reformat the file, creating noisy Git diffs and losing valuable documentation. The dual-parser approach solves this by using `ruamel.yaml`'s round-trip mode only when writing, avoiding the performance cost during reads.
The `apply_diff` function merges the new state into the preserved structure, updating only the values that changed while keeping everything else intact.
Code Evidence
Dual parser strategy from `dvc/stage/serialize.py:200-215`:
def to_single_stage_file(stage: "Stage", **kwargs):
state = stage.dumpd(**kwargs)
# When we load a stage we parse yaml with a fast parser, which strips
# off all the comments and formatting. To retain those on update we do
# a trick here:
# - reparse the same yaml text with a slow but smart ruamel yaml parser
# - apply changes to a returned structure
# - serialize it
text = stage._stage_text
if text is None:
return state
saved_state = parse_yaml_for_update(text, stage.path)
apply_diff(state, saved_state)
return saved_state