Principle:Iterative Dvc Database Import
| Knowledge Sources | |
|---|---|
| Domains | Database, Data_Import |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Database import is the extraction of data from SQL databases into version-controlled file formats, bridging the gap between live relational data sources and the file-based, content-addressable storage model used by data version control systems.
Description
Many machine learning and data science workflows depend on data that lives in relational databases -- customer records, transaction logs, sensor readings, and other operational data. However, data version control systems operate on files: they hash file contents, store files in caches, and track file versions through metafiles. Database import bridges this impedance mismatch by executing SQL queries against a database, serializing the result sets into standard file formats (such as CSV), and registering the resulting files as version-controlled artifacts.
The import process involves several coordinated steps. First, a connection is established to the database using a connection string or DSN (Data Source Name) that specifies the database engine, host, credentials, and database name. Next, a SQL query is executed to extract the desired data. The query may range from a simple SELECT * FROM table to a complex analytical query involving joins, aggregations, and filters. The result set is then streamed into an output file, with appropriate serialization for the target format. Finally, the output file is registered with the data version control system, which computes its content hash and stores it in the cache.
A critical aspect of database import is provenance tracking. The system records not just the output file but also the connection parameters and query that produced it. This metadata is stored in the pipeline definition and lock files, enabling the system to detect when the database source has changed (by re-executing the query and comparing results) and to reproduce the import by re-running the same query. This provenance chain connects the version-controlled file back to its source in the live database.
Usage
Database import is used whenever:
- A data pipeline needs to incorporate data from a relational database as a versioned input.
- A periodic ETL process extracts database snapshots for training data preparation.
- A data scientist needs a reproducible, versioned copy of a database query result for experimentation.
- A compliance workflow requires auditable snapshots of database state at specific points in time.
- An import dependency is declared in a pipeline to trigger re-execution when the source data changes.
Theoretical Basis
Impedance mismatch resolution. Relational databases and file-based version control systems operate on fundamentally different data models. Databases organize data as rows within tables with typed columns and referential integrity constraints, while version control systems operate on opaque byte streams identified by content hashes. Database import resolves this impedance mismatch through serialization:
Database Model: File Model:
Table: customers File: customers.csv
Columns: id, name, email Content: "id,name,email\n1,Alice,..."
Rows: structured tuples Hash: md5:a3f7b2c1...
Access: SQL queries Access: file path + content hash
Import Pipeline:
SQL Query -> Result Set -> Serialization -> File -> Content Hash -> Cache
Connection abstraction. Database connectivity is managed through a layered abstraction that separates connection configuration from query execution. The connection string encodes the database engine, host, port, credentials, and database name in a URI format. This follows the connection string pattern common in database middleware:
Connection String Format:
engine://user:password@host:port/database
Examples:
postgresql://analyst:pass@db.example.com:5432/warehouse
mysql://reader:pass@localhost:3306/app_data
sqlite:///path/to/local.db
Import Function:
function import_from_database(connection_string, query, output_path):
connection = connect(connection_string)
cursor = connection.execute(query)
writer = open_csv_writer(output_path)
writer.write_header(cursor.column_names)
for row in cursor:
writer.write_row(row)
writer.close()
return compute_hash(output_path)
Change detection for database sources. Unlike file-based dependencies where change detection uses content hashing, database dependencies require re-executing the query to determine if results have changed. The system stores a hash of the previous query result and compares it with a fresh execution to detect changes. This is more expensive than file-based change detection but necessary because databases are live, mutable systems without built-in content addressing.