Principle:Iterative Dvc Plot Definition Collection
| Knowledge Sources | |
|---|---|
| Domains | Visualization, Configuration_Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Plot definition collection is the process of aggregating visualization specifications from multiple declarative configuration sources and merging them with user-provided property overrides to produce a unified set of plot definitions.
Description
In pipeline-driven data science workflows, visualization definitions are not centralized in a single location. Instead, they are scattered across multiple pipeline configuration files (such as dvc.yaml) and individual output-level annotations on tracked files. Plot definition collection solves the problem of gathering these dispersed specifications into a single, coherent set of definitions that downstream rendering stages can consume.
The collection process operates in two phases. First, it harvests definitions from all available sources: pipeline-level plot blocks declared in pipeline configuration files and output-level plot annotations attached to individual stage outputs. Second, it applies a property merging strategy where user-provided overrides (such as custom axis labels, templates, or titles) take precedence over the defaults found in the configuration files. This layered merge ensures that users can customize any visualization without modifying the original pipeline definitions.
A critical design consideration is that plot identifiers can refer either to data file paths (when the plot ID maps directly to a file on disk) or to composite definitions where multiple data sources contribute to a single chart. The collection logic must distinguish between these two cases, resolve file paths relative to their originating configuration directory, and handle directory unpacking when a plot target points to a directory of data files rather than a single file.
Usage
Use plot definition collection when:
- A pipeline defines visualizations across multiple dvc.yaml files in a monorepo or nested project structure, and all definitions must be gathered into a single registry.
- Users invoke plot commands (show, diff) with optional property overrides (--template, -x, -y) that must be merged on top of existing definitions.
- Plot targets may refer to directories that need to be expanded into individual file-level definitions.
- A system must support both path-based plot identifiers (where the plot ID is a file path) and dictionary-based identifiers (where the plot ID is a logical name with explicit x and y source mappings).
Theoretical Basis
The core algorithm for plot definition collection follows a multi-source aggregation and merge pattern:
FUNCTION collect_definitions(repo, targets, user_props):
result = empty nested dictionary
// Phase 1: Collect from pipeline configuration files
FOR each dvc_file in repo.pipeline_files:
FOR each plot_definition in dvc_file.plots:
IF plot_definition matches any target (or no targets specified):
resolved = resolve_paths(plot_definition, dvc_file.directory)
merged_props = plot_definition.properties UNION user_props
INSERT resolved INTO result[dvc_file.path]
// Phase 2: Collect from output-level annotations
FOR each output in repo.tracked_outputs:
IF output.is_plot:
IF output matches any target (or no targets specified):
plot_props = extract_plot_properties(output)
merged_props = plot_props UNION user_props
unpacked = unpack_if_directory(output.path, merged_props)
MERGE unpacked INTO result[""]
// Phase 3: Handle bare file targets
FOR each target in targets:
IF target is a filesystem path AND (no results yet OR path exists):
unpacked = unpack_if_directory(target, user_props)
MERGE unpacked INTO result[""]
RETURN result
The merging strategy uses dictionary union where later values (user-provided properties) override earlier values (configuration defaults). The dpath library is employed for deep dictionary merging across nested structures. Path normalization ensures consistent behavior across operating systems by converting all paths to POSIX-style forward slashes.
When a plot identifier is not a direct file path (i.e., it defines explicit y axis source mappings as a dictionary), the source paths referenced in the x and y properties are adjusted relative to the configuration file's directory. This ensures that relative paths within a pipeline file resolve correctly regardless of the working directory from which the command is invoked.