Principle:Iterative Dvc Vega Lite Conversion

Knowledge Sources	DVC Documentation
Domains	Visualization, Data_Transformation
Last Updated	2026-02-10 00:00 GMT

Overview

Vega-Lite conversion is the process of transforming parsed data records into flat datapoint lists with standardized field names suitable for Vega-Lite chart specifications, including automatic inference of x and y axis fields.

Description

After raw data files have been parsed into lists of dictionaries, the records are not yet in a form that Vega-Lite templates can consume. Vega-Lite expects data as a flat array of objects where each object contains all the fields needed for a single mark on the chart, including metadata fields that identify which revision and source file each data point came from. Vega-Lite conversion bridges this gap by transforming heterogeneous parsed data into a standardized datapoint format.

The conversion process has several responsibilities. First, it must infer which data fields should be mapped to the x and y axes when the user has not explicitly specified them. For the x axis, if no field is given, an auto-generated step index is used. For the y axis, if not specified, the converter inspects the data and selects the last field of the first record as a reasonable default. Second, when a plot definition references multiple source files (e.g., training loss from one file and validation loss from another), the converter must unify these into a single flat list of datapoints with consistent field names. If different files contribute different y field names, a synthetic dvc_inferred_y_value field is created to normalize them. Third, each datapoint is annotated with metadata fields: rev (the revision), filename (the source file), and field (the original field name). These metadata fields enable Vega-Lite templates to color-code, facet, or filter data by revision or source.

A separate ImageConverter handles image plot types. Instead of producing flat datapoints, it converts raw image bytes into either base64-encoded data URIs (for inline HTML rendering) or writes them to disk files (for file-based output), returning datapoints that reference the image source location.

Usage

Use Vega-Lite conversion when:

Parsed data records from CSV, JSON, or YAML files must be transformed into a Vega-Lite compatible datapoint format with revision and filename metadata.
The x and y axis fields must be inferred from data structure when not explicitly provided by the user.
Multiple data sources contribute to a single chart and their field names must be unified under a common schema.
Image data must be converted to base64-encoded URIs or written to output files for rendering.

Theoretical Basis

The conversion algorithm operates in two phases: property inference and datapoint flattening.

Phase 1: Property Inference (convert)

FUNCTION convert(plot_id, data, properties):
    inferred = {}

    // Infer x axis mapping
    x = properties.get("x")
    y = properties.get("y")
    IF x is a simple string:
        IF y is a dict (multi-file):
            // Duplicate x for each y file
            inferred["x"] = {file: x FOR each file in y}
        ELSE:
            inferred["x"] = {plot_id: x}

    // Infer y axis mapping
    IF y is None:
        // Auto-detect: use last field of first data record
        inferred["y"] = {plot_id: last_field(data[plot_id])}
    ELSE IF y is not a dict:
        // Simple field name: assume plot_id as file
        inferred["y"] = {plot_id: y}

    // Build file-to-datapoints mapping
    file2datapoints = {}
    FOR each file, content in data:
        file2datapoints[file] = extract_flat_records(content)

    // Infer axis labels
    properties["y_label"] = infer_y_label(properties)
    properties["x_label"] = infer_x_label(properties)

    RETURN file2datapoints, properties UNION inferred

Phase 2: Datapoint Flattening (flat_datapoints)

FUNCTION flat_datapoints(plot_id, data, properties, revision):
    file2datapoints, properties = convert(plot_id, data, properties)

    xs = extract_x_sources(properties, file2datapoints)
    ys = extract_y_sources(properties, file2datapoints)

    // Determine x field name
    IF no x sources:
        x_field = INDEX  // auto-generated step counter
    ELSE IF multiple x fields with different names:
        x_field = "dvc_inferred_x_value"
    ELSE:
        x_field = first x field name

    // Determine y field name
    IF multiple y fields with different names:
        y_field = "dvc_inferred_y_value"
    ELSE:
        y_field = first y field name

    all_datapoints = []
    FOR each (y_file, y_field_name) in ys:
        datapoints = copy(file2datapoints[y_file])

        // Normalize y values if unified field needed
        IF y_field == "dvc_inferred_y_value":
            COPY y_field_name values to "dvc_inferred_y_value"

        // Set x values: either from index or from source field
        IF x_field == INDEX:
            SET sequential index (0, 1, 2, ...) as INDEX field
        ELSE:
            COPY x values from x source datapoints

        // Annotate with metadata
        FOR each datapoint:
            datapoint[REVISION] = revision
            datapoint[FILENAME] = short_filename(y_file)
            datapoint[FIELD] = y_field_name

        all_datapoints.extend(datapoints)

    RETURN all_datapoints, properties

The key insight is that the REVISION, FILENAME, and FIELD metadata fields are what enable Vega-Lite templates to produce multi-series charts. A template can use REVISION as a color encoding to overlay data from different Git commits, or FILENAME as a facet to create small multiples comparing different data sources.

Related Pages

Implemented By

Implementation:Iterative_Dvc_VegaConverter_Convert

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment