Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Iterative Dvc Resolve Data Sources

From Leeroopedia


Knowledge Sources
Domains Visualization, Data_Processing
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for resolving lazy data source callables into parsed data by reading and parsing files in multiple formats, provided by the DVC library.

Description

The _resolve_data_sources function is the parallel data loading engine of the DVC plot visualization pipeline. It walks the nested plots_data dictionary to find all entries containing a "data_source" key (which holds a functools.partial callable pointing to the parse function), invokes each callable using a thread pool for concurrent I/O, and mutates the entries in-place to replace the callable with the actual parsed data.

The companion parse function is the format-aware file reader. It detects the file type by extension and dispatches to the appropriate parser: csv.DictReader for CSV/TSV files, the dvc.utils.serialize.PARSERS registry for JSON and YAML files, and raw binary reading for image files. The CSV/TSV path supports an optional header property that controls whether the first row is treated as column names or whether numeric indices are generated.

Together, these functions bridge the gap between the lazy data source references established during collection and the concrete data records needed for Vega-Lite conversion.

Usage

Use _resolve_data_sources after Plots.collect has yielded per-revision data containing lazy "data_source" callables. It is called internally by Plots.show for each revision's data. Use parse directly when you need to read and parse a single plot data file outside the standard collection workflow.

Code Reference

Source Location

  • Repository: DVC
  • File: dvc/repo/plots/__init__.py
  • Lines: L280-314 (_resolve_data_sources), L547-566 (parse)

Signature

def _resolve_data_sources(
    plots_data: dict,
    rev: str,
    cache: bool = False,
) -> None:
    ...
@error_handler
def parse(
    fs: "FileSystem",
    path: str,
    props: Optional[dict] = None,
    **fs_kwargs,
) -> Union[bytes, list[dict]]:
    ...

Import

# Internal module functions - typically accessed through Plots.show()
from dvc.repo.plots import _resolve_data_sources
from dvc.repo.plots import parse

I/O Contract

Inputs

Name Type Required Description
plots_data dict Yes Nested dictionary (as yielded by Plots.collect) containing "data_source" callable entries. Mutated in-place during resolution.
rev str Yes The revision identifier string, used for progress bar display (e.g., "workspace" or a 7-character Git SHA).
cache bool No Whether to pass cache=True to the data source callables for filesystem caching. Defaults to False.
fs FileSystem Yes (For parse) The filesystem abstraction to read from (LocalFileSystem for workspace, GitFileSystem for revisions).
path str Yes (For parse) The file path to read and parse.
props Optional[dict] No (For parse) Display properties dict; the "header" key controls CSV/TSV header handling. Defaults to empty dict.

Outputs

Name Type Description
(none) None _resolve_data_sources returns None; it mutates plots_data in-place. Each "data_source" key is removed and replaced with a "data" key containing parsed content.
result bytes (From parse) Raw binary content when the file is a supported image format (PNG, JPEG, etc.).
result list[dict] (From parse) List of dictionaries where each dict represents one row/record, with keys as column/field names. Returned for CSV, TSV, JSON, and YAML formats.

Usage Examples

Basic Usage

from dvc.repo import Repo
from dvc.repo.plots import Plots, _resolve_data_sources

repo = Repo()
plots = Plots(repo)

# Collect yields lazy data
for rev_data in plots.collect(targets=None, revs=["workspace"]):
    # At this point, data sources are unresolved callables
    # Resolve them in parallel
    _resolve_data_sources(rev_data, rev="workspace", cache=True)

    # Now the data is loaded in-place
    for rev, content in rev_data.items():
        sources = content.get("sources", {}).get("data", {})
        for filepath, source in sources.items():
            # "data_source" key is gone, replaced by "data"
            if "data" in source:
                records = source["data"]
                print(f"{filepath}: {len(records)} records loaded")

Direct File Parsing

from dvc.repo.plots import parse
from dvc.fs import LocalFileSystem

fs = LocalFileSystem()

# Parse a CSV file with headers
csv_data = parse(fs, "metrics/train_loss.csv", props={"header": True})
# Returns: [{"epoch": "1", "loss": "0.95"}, {"epoch": "2", "loss": "0.80"}, ...]

# Parse a CSV file without headers
csv_data = parse(fs, "metrics/raw_output.csv", props={"header": False})
# Returns: [{"0": "1", "1": "0.95"}, {"0": "2", "1": "0.80"}, ...]

# Parse a JSON metrics file
json_data = parse(fs, "metrics/results.json")
# Returns: [{"accuracy": 0.92, "f1": 0.89}]

# Parse an image file
image_bytes = parse(fs, "plots/confusion_matrix.png")
# Returns: b'\x89PNG\r\n...' (raw bytes)

TSV Parsing

from dvc.repo.plots import parse
from dvc.fs import LocalFileSystem

fs = LocalFileSystem()

# Parse a TSV file (tab-separated)
tsv_data = parse(fs, "metrics/experiment_log.tsv", props={"header": True})
# Returns: [{"step": "0", "train_loss": "1.2", "val_loss": "1.5"}, ...]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment