Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Treeverse LakeFS Imported Data Verification

From Leeroopedia


Knowledge Sources
Domains Data_Import, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Imported data verification is the practice of confirming that expected objects are present at their expected paths and have correct metadata after an import or data mutation operation.

Description

After completing a data import into lakeFS, verification is an essential step to ensure data integrity. Import operations can fail partially, source data can be misconfigured, and destination path mappings can contain errors. Without verification, downstream consumers may operate on incomplete or incorrect data, leading to silent data quality issues.

Verification in the context of lakeFS imports involves:

  • Object presence checks -- Confirming that specific expected files exist at their expected destination paths within the branch
  • Prefix enumeration -- Listing all objects under the import destination prefix to verify the total count matches expectations
  • Metadata validation -- Checking that object properties (size, checksum, content type) match the original source data
  • Path structure validation -- Ensuring that the directory hierarchy was correctly mapped from source to destination
  • Content spot-checking -- Optionally reading a sample of imported objects to verify their content is accessible (the zero-copy pointer resolves correctly)

The lakeFS object listing API provides the primary mechanism for verification. By listing objects with a prefix filter matching the import destination, the client can enumerate all imported objects and compare them against expected values.

Usage

Use imported data verification when:

  • Post-import validation -- As the final step of any import pipeline, before signaling downstream systems that new data is available
  • Data quality gates -- Implementing automated checks in CI/CD or Airflow DAGs that block pipeline progression if verification fails
  • Debugging import failures -- When an import completes but downstream queries return unexpected results, listing objects helps identify missing or misplaced files
  • Compliance and audit -- Documenting that imported data matches the source manifest for regulatory or governance purposes

Theoretical Basis

Data verification after import follows the general postcondition validation pattern in software engineering. The import operation has a defined contract: given a set of import locations, it should produce a specific set of objects in the repository. Verification checks this postcondition.

VERIFICATION STRATEGY:

  precondition:  import_locations = [(type, path, destination), ...]
  operation:     import_start(...) --> wait_for_completion(...)
  postcondition: for each import_location:
                     objects_at(destination) matches objects_at(source_path)

There are multiple levels of verification, ordered by increasing thoroughness:

Level 1: EXISTENCE CHECK
  - List objects under the destination prefix
  - Verify the count is non-zero
  - Fast, catches total import failures

Level 2: COUNT VALIDATION
  - Compare the number of imported objects against the expected count
  - Catches partial imports where some objects were missed

Level 3: KEY VALIDATION
  - Check that specific expected object paths exist
  - Sample a known subset of keys from the source
  - Catches path mapping errors

Level 4: METADATA VALIDATION
  - For each sampled object, verify size_bytes, checksum, content_type
  - Catches data corruption or incorrect pointer resolution

Level 5: CONTENT VALIDATION
  - Read the actual content of sampled objects
  - Verify against known checksums or expected data patterns
  - Most thorough but most expensive

The lakeFS integration test suite (esti/import_test.go) implements Level 3 verification: it maintains a list of known file paths (importFilesToCheck) and verifies each one exists at the expected destination after import. It also performs a full listing to validate path prefixes and ordering:

for each expected_file in known_files:
    response = get_object(repo, branch, destination_prefix + expected_file)
    assert response.status == 200
    if expected_content_length > 0:
        assert response.content_length == expected_content_length

# Full listing verification
objects = list_all_objects(repo, branch)
for each object in objects:
    assert object.path starts_with destination_prefix
    assert objects are sorted by path (lexicographic order)

Prefix-based listing is the most practical approach for large imports where checking every individual object is infeasible. The prefix filter ensures only objects under the import destination are returned, reducing the result set and speeding up verification.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment