Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Treeverse LakeFS Write Audit Publish With Hooks

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Data_Quality, Data_Governance
Last Updated 2026-02-08 10:00 GMT

Overview

End-to-end process for implementing data quality gates using lakeFS branches and action hooks to ensure only validated data reaches production.

Description

This workflow implements the Write-Audit-Publish pattern for data lakes. Data is written to an isolated branch, automated quality checks run via pre-commit or pre-merge hooks (using Lua scripts or webhooks), and only data that passes validation is published (merged) to the production branch. lakeFS action hooks can enforce schema validation, file format checks, PII detection, and custom business rules. Failed hooks block the operation, preventing bad data from reaching downstream consumers.

Usage

Execute this workflow when you need to enforce data quality and governance policies before data reaches production consumers. Common triggers include: building a data pipeline that must pass quality gates, complying with data governance policies requiring validation before publication, preventing schema drift in production data, or implementing automated PII detection and removal before data is shared.

Execution Steps

Step 1: Define Action Hooks

Create action configuration files that define when hooks should fire and what validation they should perform. Actions are defined as YAML files stored in the repository under the _lakefs_actions/ prefix. Each action specifies an event trigger (pre-commit, pre-merge, etc.), the branches it applies to, and one or more hooks (Lua scripts or webhook endpoints).

Key considerations:

  • Actions are defined as YAML configuration files
  • Hooks can be Lua scripts (executed server-side) or webhooks (external HTTP endpoints)
  • Supported events: pre-commit, post-commit, pre-merge, post-merge, pre-create-branch, pre-create-tag, and more
  • Branch patterns control which branches trigger the hooks

Step 2: Upload Action Scripts

Upload the action YAML configuration and any associated Lua scripts to the repository. Lua hooks execute server-side and can validate file formats, check schemas, enforce naming conventions, or run custom business logic. Webhook hooks call external HTTP endpoints for more complex validation.

Key considerations:

  • Action files must be placed under _lakefs_actions/ in the repository
  • Lua scripts have access to lakeFS APIs for reading objects and metadata
  • Webhook hooks receive event context (repository, branch, commit info) as JSON payloads
  • Environment variables can be passed to hooks for configuration

Step 3: Write Data to Branch

Upload new or modified data objects to an isolated branch. This branch serves as a staging area where data changes accumulate before validation. Multiple files can be uploaded in a single session before committing.

Key considerations:

  • Use a dedicated branch (not the production branch) for staging data
  • All changes remain uncommitted and isolated until explicitly committed
  • Multiple data producers can work on separate branches simultaneously

Step 4: Commit With Pre-Commit Validation

Attempt to commit the staged changes. If pre-commit hooks are configured, they execute automatically before the commit is finalized. The hooks validate the staged data according to the defined rules. If any hook fails, the commit is blocked and the data remains uncommitted.

Key considerations:

  • Pre-commit hooks run synchronously — the commit waits for hook completion
  • A failed hook blocks the entire commit operation
  • Hook execution results (pass/fail, logs) are recorded in the action runs
  • Multiple hooks can be chained — all must pass for the commit to succeed

Step 5: Review Hook Results

Inspect the action run results to understand which hooks passed or failed. The actions API provides detailed execution logs for each hook, including any error messages or validation failures. This information guides data corrections.

Key considerations:

  • Action runs are queryable via the lakeFS API
  • Each hook run includes status (completed, failed, skipped), duration, and logs
  • Failed hooks include error details for debugging
  • Post-commit hooks (if defined) run after successful commits for notifications or downstream triggers

Step 6: Merge to Production

After data passes all validation hooks, merge the staging branch into the production branch. Pre-merge hooks provide a second layer of validation at the merge boundary. Only data that passes both pre-commit and pre-merge checks reaches production.

Key considerations:

  • Pre-merge hooks can enforce additional cross-branch validation
  • The merge creates a commit on the production branch with full audit trail
  • Post-merge hooks can trigger downstream notifications or pipeline runs
  • Branch protection rules can restrict who can merge to production branches

Execution Diagram

GitHub URL

Workflow Repository