Principle:Iterative Dvc Plot Data Parsing
| Knowledge Sources | |
|---|---|
| Domains | Visualization, Data_Processing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Plot data parsing is the process of reading raw data files in heterogeneous formats and converting them into normalized lists of dictionaries suitable for chart rendering.
Description
Data science workflows produce metrics and plot data in a variety of file formats. Training logs may be written as CSV files, configuration results as JSON, and experiment metadata as YAML. For a visualization system to render charts from these diverse sources, it must implement a unified parsing layer that detects each file's format, applies the appropriate parser, and produces a consistent output structure that downstream rendering components can consume without format-specific logic.
Plot data parsing addresses this challenge by implementing format-aware file reading with automatic detection based on file extension. The supported formats include CSV (comma-separated), TSV (tab-separated), JSON, and YAML. Additionally, binary image files (PNG, JPEG, etc.) are handled as a special case, returned as raw bytes rather than parsed records. The parsing layer also handles edge cases such as CSV files without headers (where column names are generated as numeric indices) and deeply nested JSON or YAML structures that must be flattened into tabular rows.
A critical performance consideration is that plot commands may reference many data files across multiple revisions. To avoid sequential I/O bottlenecks, the data resolution layer employs a thread pool to invoke the lazy data source callables in parallel. This parallel resolution is especially important when data files are served from Git's object store (for historical revisions), where individual file reads may incur decompression overhead.
Usage
Use plot data parsing when:
- Data files for visualization exist in multiple formats (CSV, TSV, JSON, YAML) and must be normalized to a common list-of-dictionaries structure.
- Image files must be distinguished from tabular data and returned as raw binary content for base64 encoding or file writing.
- CSV or TSV files may or may not include header rows, requiring configurable header handling.
- Multiple data files must be loaded concurrently to minimize I/O latency, particularly when reading from Git object stores across many revisions.
Theoretical Basis
The parsing algorithm combines format detection with format-specific readers:
FUNCTION parse(filesystem, path, properties):
extension = extract_extension(path)
// Binary image handling
IF extension IN supported_image_extensions:
RETURN read_binary(filesystem, path)
// Validate supported text formats
IF extension NOT IN {".json", ".yaml", ".yml", ".csv", ".tsv"}:
RAISE PlotMetricTypeError(path)
content = read_text(filesystem, path, encoding="utf-8")
// Delimiter-separated values
IF extension IN {".csv", ".tsv"}:
delimiter = TAB if extension == ".tsv" else COMMA
header = properties.get("header", True)
RETURN load_separated_values(content, delimiter, header)
// Structured data (JSON/YAML)
RETURN PARSERS[extension](content, path)
The separated-value loader handles the header/no-header distinction:
FUNCTION load_separated_values(content, delimiter, header):
IF header is True:
reader = DictReader(content, delimiter=delimiter)
ELSE:
// Peek at first row to determine column count
first_row = read_first_row(content, delimiter)
field_names = ["0", "1", "2", ..., str(len(first_row) - 1)]
reader = DictReader(content, delimiter, fieldnames=field_names)
RETURN list(reader) // List[Dict[str, str]]
The parallel resolution layer uses a thread pool executor to resolve all lazy data sources concurrently:
FUNCTION resolve_data_sources(plots_data, revision):
// Walk the nested dictionary to find all lazy callables
to_resolve = find_all_entries_with_key("data_source", plots_data)
FUNCTION resolve(entry):
callable = entry.pop("data_source")
result = callable() // Invokes parse()
entry.update(result) // Mutates in-place with parsed data
// Execute in parallel with bounded thread pool
WITH ThreadPoolExecutor(max_workers=min(16, 4 * cpu_count())):
parallel_map(resolve, to_resolve)
display_progress(total=len(to_resolve), description=revision)
The in-place mutation pattern is deliberate: each entry in the nested plots_data dictionary starts with a "data_source" key holding a callable. After resolution, this key is removed and replaced with the actual parsed data (either "data" containing a list of dictionaries, or raw bytes for images). This avoids constructing a separate output structure and keeps the data flow simple for downstream consumers.