Principle:Treeverse LakeFS Imported Data Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Import, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Imported data verification is the practice of confirming that expected objects are present at their expected paths and have correct metadata after an import or data mutation operation.
Description
After completing a data import into lakeFS, verification is an essential step to ensure data integrity. Import operations can fail partially, source data can be misconfigured, and destination path mappings can contain errors. Without verification, downstream consumers may operate on incomplete or incorrect data, leading to silent data quality issues.
Verification in the context of lakeFS imports involves:
- Object presence checks -- Confirming that specific expected files exist at their expected destination paths within the branch
- Prefix enumeration -- Listing all objects under the import destination prefix to verify the total count matches expectations
- Metadata validation -- Checking that object properties (size, checksum, content type) match the original source data
- Path structure validation -- Ensuring that the directory hierarchy was correctly mapped from source to destination
- Content spot-checking -- Optionally reading a sample of imported objects to verify their content is accessible (the zero-copy pointer resolves correctly)
The lakeFS object listing API provides the primary mechanism for verification. By listing objects with a prefix filter matching the import destination, the client can enumerate all imported objects and compare them against expected values.
Usage
Use imported data verification when:
- Post-import validation -- As the final step of any import pipeline, before signaling downstream systems that new data is available
- Data quality gates -- Implementing automated checks in CI/CD or Airflow DAGs that block pipeline progression if verification fails
- Debugging import failures -- When an import completes but downstream queries return unexpected results, listing objects helps identify missing or misplaced files
- Compliance and audit -- Documenting that imported data matches the source manifest for regulatory or governance purposes
Theoretical Basis
Data verification after import follows the general postcondition validation pattern in software engineering. The import operation has a defined contract: given a set of import locations, it should produce a specific set of objects in the repository. Verification checks this postcondition.
VERIFICATION STRATEGY:
precondition: import_locations = [(type, path, destination), ...]
operation: import_start(...) --> wait_for_completion(...)
postcondition: for each import_location:
objects_at(destination) matches objects_at(source_path)
There are multiple levels of verification, ordered by increasing thoroughness:
Level 1: EXISTENCE CHECK
- List objects under the destination prefix
- Verify the count is non-zero
- Fast, catches total import failures
Level 2: COUNT VALIDATION
- Compare the number of imported objects against the expected count
- Catches partial imports where some objects were missed
Level 3: KEY VALIDATION
- Check that specific expected object paths exist
- Sample a known subset of keys from the source
- Catches path mapping errors
Level 4: METADATA VALIDATION
- For each sampled object, verify size_bytes, checksum, content_type
- Catches data corruption or incorrect pointer resolution
Level 5: CONTENT VALIDATION
- Read the actual content of sampled objects
- Verify against known checksums or expected data patterns
- Most thorough but most expensive
The lakeFS integration test suite (esti/import_test.go) implements Level 3 verification: it maintains a list of known file paths (importFilesToCheck) and verifies each one exists at the expected destination after import. It also performs a full listing to validate path prefixes and ordering:
for each expected_file in known_files:
response = get_object(repo, branch, destination_prefix + expected_file)
assert response.status == 200
if expected_content_length > 0:
assert response.content_length == expected_content_length
# Full listing verification
objects = list_all_objects(repo, branch)
for each object in objects:
assert object.path starts_with destination_prefix
assert objects are sorted by path (lexicographic order)
Prefix-based listing is the most practical approach for large imports where checking every individual object is infeasible. The prefix filter ensures only objects under the import destination are returned, reducing the result set and speeding up verification.