Principle:Treeverse LakeFS Imported Data Verification

Knowledge Sources	lakeFS lakeFS Documentation
Domains	Data_Import, Data_Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Imported data verification is the practice of confirming that expected objects are present at their expected paths and have correct metadata after an import or data mutation operation.

Description

After completing a data import into lakeFS, verification is an essential step to ensure data integrity. Import operations can fail partially, source data can be misconfigured, and destination path mappings can contain errors. Without verification, downstream consumers may operate on incomplete or incorrect data, leading to silent data quality issues.

Verification in the context of lakeFS imports involves:

Object presence checks -- Confirming that specific expected files exist at their expected destination paths within the branch
Prefix enumeration -- Listing all objects under the import destination prefix to verify the total count matches expectations
Metadata validation -- Checking that object properties (size, checksum, content type) match the original source data
Path structure validation -- Ensuring that the directory hierarchy was correctly mapped from source to destination
Content spot-checking -- Optionally reading a sample of imported objects to verify their content is accessible (the zero-copy pointer resolves correctly)

The lakeFS object listing API provides the primary mechanism for verification. By listing objects with a prefix filter matching the import destination, the client can enumerate all imported objects and compare them against expected values.

Usage

Use imported data verification when:

Post-import validation -- As the final step of any import pipeline, before signaling downstream systems that new data is available
Data quality gates -- Implementing automated checks in CI/CD or Airflow DAGs that block pipeline progression if verification fails
Debugging import failures -- When an import completes but downstream queries return unexpected results, listing objects helps identify missing or misplaced files
Compliance and audit -- Documenting that imported data matches the source manifest for regulatory or governance purposes

Theoretical Basis

Data verification after import follows the general postcondition validation pattern in software engineering. The import operation has a defined contract: given a set of import locations, it should produce a specific set of objects in the repository. Verification checks this postcondition.

VERIFICATION STRATEGY:

  precondition:  import_locations = [(type, path, destination), ...]
  operation:     import_start(...) --> wait_for_completion(...)
  postcondition: for each import_location:
                     objects_at(destination) matches objects_at(source_path)

There are multiple levels of verification, ordered by increasing thoroughness:

Level 1: EXISTENCE CHECK
  - List objects under the destination prefix
  - Verify the count is non-zero
  - Fast, catches total import failures

Level 2: COUNT VALIDATION
  - Compare the number of imported objects against the expected count
  - Catches partial imports where some objects were missed

Level 3: KEY VALIDATION
  - Check that specific expected object paths exist
  - Sample a known subset of keys from the source
  - Catches path mapping errors

Level 4: METADATA VALIDATION
  - For each sampled object, verify size_bytes, checksum, content_type
  - Catches data corruption or incorrect pointer resolution

Level 5: CONTENT VALIDATION
  - Read the actual content of sampled objects
  - Verify against known checksums or expected data patterns
  - Most thorough but most expensive

The lakeFS integration test suite (esti/import_test.go) implements Level 3 verification: it maintains a list of known file paths (importFilesToCheck) and verifies each one exists at the expected destination after import. It also performs a full listing to validate path prefixes and ordering:

for each expected_file in known_files:
    response = get_object(repo, branch, destination_prefix + expected_file)
    assert response.status == 200
    if expected_content_length > 0:
        assert response.content_length == expected_content_length

# Full listing verification
objects = list_all_objects(repo, branch)
for each object in objects:
    assert object.path starts_with destination_prefix
    assert objects are sorted by path (lexicographic order)

Prefix-based listing is the most practical approach for large imports where checking every individual object is infeasible. The prefix filter ensures only objects under the import destination are returned, reducing the result set and speeding up verification.

Related Pages

Implemented By

Implementation:Treeverse_LakeFS_ListObjects_For_Import

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment