Principle:Iterative Dvc Vega Lite Conversion
| Knowledge Sources | |
|---|---|
| Domains | Visualization, Data_Transformation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Vega-Lite conversion is the process of transforming parsed data records into flat datapoint lists with standardized field names suitable for Vega-Lite chart specifications, including automatic inference of x and y axis fields.
Description
After raw data files have been parsed into lists of dictionaries, the records are not yet in a form that Vega-Lite templates can consume. Vega-Lite expects data as a flat array of objects where each object contains all the fields needed for a single mark on the chart, including metadata fields that identify which revision and source file each data point came from. Vega-Lite conversion bridges this gap by transforming heterogeneous parsed data into a standardized datapoint format.
The conversion process has several responsibilities. First, it must infer which data fields should be mapped to the x and y axes when the user has not explicitly specified them. For the x axis, if no field is given, an auto-generated step index is used. For the y axis, if not specified, the converter inspects the data and selects the last field of the first record as a reasonable default. Second, when a plot definition references multiple source files (e.g., training loss from one file and validation loss from another), the converter must unify these into a single flat list of datapoints with consistent field names. If different files contribute different y field names, a synthetic dvc_inferred_y_value field is created to normalize them. Third, each datapoint is annotated with metadata fields: rev (the revision), filename (the source file), and field (the original field name). These metadata fields enable Vega-Lite templates to color-code, facet, or filter data by revision or source.
A separate ImageConverter handles image plot types. Instead of producing flat datapoints, it converts raw image bytes into either base64-encoded data URIs (for inline HTML rendering) or writes them to disk files (for file-based output), returning datapoints that reference the image source location.
Usage
Use Vega-Lite conversion when:
- Parsed data records from CSV, JSON, or YAML files must be transformed into a Vega-Lite compatible datapoint format with revision and filename metadata.
- The x and y axis fields must be inferred from data structure when not explicitly provided by the user.
- Multiple data sources contribute to a single chart and their field names must be unified under a common schema.
- Image data must be converted to base64-encoded URIs or written to output files for rendering.
Theoretical Basis
The conversion algorithm operates in two phases: property inference and datapoint flattening.
Phase 1: Property Inference (convert)
FUNCTION convert(plot_id, data, properties):
inferred = {}
// Infer x axis mapping
x = properties.get("x")
y = properties.get("y")
IF x is a simple string:
IF y is a dict (multi-file):
// Duplicate x for each y file
inferred["x"] = {file: x FOR each file in y}
ELSE:
inferred["x"] = {plot_id: x}
// Infer y axis mapping
IF y is None:
// Auto-detect: use last field of first data record
inferred["y"] = {plot_id: last_field(data[plot_id])}
ELSE IF y is not a dict:
// Simple field name: assume plot_id as file
inferred["y"] = {plot_id: y}
// Build file-to-datapoints mapping
file2datapoints = {}
FOR each file, content in data:
file2datapoints[file] = extract_flat_records(content)
// Infer axis labels
properties["y_label"] = infer_y_label(properties)
properties["x_label"] = infer_x_label(properties)
RETURN file2datapoints, properties UNION inferred
Phase 2: Datapoint Flattening (flat_datapoints)
FUNCTION flat_datapoints(plot_id, data, properties, revision):
file2datapoints, properties = convert(plot_id, data, properties)
xs = extract_x_sources(properties, file2datapoints)
ys = extract_y_sources(properties, file2datapoints)
// Determine x field name
IF no x sources:
x_field = INDEX // auto-generated step counter
ELSE IF multiple x fields with different names:
x_field = "dvc_inferred_x_value"
ELSE:
x_field = first x field name
// Determine y field name
IF multiple y fields with different names:
y_field = "dvc_inferred_y_value"
ELSE:
y_field = first y field name
all_datapoints = []
FOR each (y_file, y_field_name) in ys:
datapoints = copy(file2datapoints[y_file])
// Normalize y values if unified field needed
IF y_field == "dvc_inferred_y_value":
COPY y_field_name values to "dvc_inferred_y_value"
// Set x values: either from index or from source field
IF x_field == INDEX:
SET sequential index (0, 1, 2, ...) as INDEX field
ELSE:
COPY x values from x source datapoints
// Annotate with metadata
FOR each datapoint:
datapoint[REVISION] = revision
datapoint[FILENAME] = short_filename(y_file)
datapoint[FIELD] = y_field_name
all_datapoints.extend(datapoints)
RETURN all_datapoints, properties
The key insight is that the REVISION, FILENAME, and FIELD metadata fields are what enable Vega-Lite templates to produce multi-series charts. A template can use REVISION as a color encoding to overlay data from different Git commits, or FILENAME as a facet to create small multiples comparing different data sources.