Principle:Iterative Dvc Database Import

Knowledge Sources	DVC Documentation
Domains	Database, Data_Import
Last Updated	2026-02-10 00:00 GMT

Overview

Database import is the extraction of data from SQL databases into version-controlled file formats, bridging the gap between live relational data sources and the file-based, content-addressable storage model used by data version control systems.

Description

Many machine learning and data science workflows depend on data that lives in relational databases -- customer records, transaction logs, sensor readings, and other operational data. However, data version control systems operate on files: they hash file contents, store files in caches, and track file versions through metafiles. Database import bridges this impedance mismatch by executing SQL queries against a database, serializing the result sets into standard file formats (such as CSV), and registering the resulting files as version-controlled artifacts.

The import process involves several coordinated steps. First, a connection is established to the database using a connection string or DSN (Data Source Name) that specifies the database engine, host, credentials, and database name. Next, a SQL query is executed to extract the desired data. The query may range from a simple SELECT * FROM table to a complex analytical query involving joins, aggregations, and filters. The result set is then streamed into an output file, with appropriate serialization for the target format. Finally, the output file is registered with the data version control system, which computes its content hash and stores it in the cache.

A critical aspect of database import is provenance tracking. The system records not just the output file but also the connection parameters and query that produced it. This metadata is stored in the pipeline definition and lock files, enabling the system to detect when the database source has changed (by re-executing the query and comparing results) and to reproduce the import by re-running the same query. This provenance chain connects the version-controlled file back to its source in the live database.

Usage

Database import is used whenever:

A data pipeline needs to incorporate data from a relational database as a versioned input.
A periodic ETL process extracts database snapshots for training data preparation.
A data scientist needs a reproducible, versioned copy of a database query result for experimentation.
A compliance workflow requires auditable snapshots of database state at specific points in time.
An import dependency is declared in a pipeline to trigger re-execution when the source data changes.

Theoretical Basis

Impedance mismatch resolution. Relational databases and file-based version control systems operate on fundamentally different data models. Databases organize data as rows within tables with typed columns and referential integrity constraints, while version control systems operate on opaque byte streams identified by content hashes. Database import resolves this impedance mismatch through serialization:

Database Model:                 File Model:
    Table: customers                File: customers.csv
    Columns: id, name, email        Content: "id,name,email\n1,Alice,..."
    Rows: structured tuples         Hash: md5:a3f7b2c1...
    Access: SQL queries             Access: file path + content hash

Import Pipeline:
    SQL Query -> Result Set -> Serialization -> File -> Content Hash -> Cache

Connection abstraction. Database connectivity is managed through a layered abstraction that separates connection configuration from query execution. The connection string encodes the database engine, host, port, credentials, and database name in a URI format. This follows the connection string pattern common in database middleware:

Connection String Format:
    engine://user:password@host:port/database

Examples:
    postgresql://analyst:pass@db.example.com:5432/warehouse
    mysql://reader:pass@localhost:3306/app_data
    sqlite:///path/to/local.db

Import Function:
    function import_from_database(connection_string, query, output_path):
        connection = connect(connection_string)
        cursor = connection.execute(query)
        writer = open_csv_writer(output_path)
        writer.write_header(cursor.column_names)
        for row in cursor:
            writer.write_row(row)
        writer.close()
        return compute_hash(output_path)

Change detection for database sources. Unlike file-based dependencies where change detection uses content hashing, database dependencies require re-executing the query to determine if results have changed. The system stores a hash of the previous query result and compares it with a fresh execution to detect changes. This is more expensive than file-based change detection but necessary because databases are live, mutable systems without built-in content addressing.

Related Pages

Implemented By

Implementation:Iterative_Dvc_Database

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment