Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pola rs Polars Output Validation Pattern

From Leeroopedia


Knowledge Sources
Domains Data_Quality, ETL_Validation, Data_Engineering
Last Updated 2026-02-09 10:00 GMT
Type Pattern Doc

Overview

A validation pattern using Polars read operations, schema inspection, and equality checks to verify data integrity after write operations through round-trip comparison.

Description

The Output Validation Pattern documents a reusable pattern (not a single API) for verifying that written data faithfully represents the original DataFrame. It combines pl.read_parquet (or other read functions) to re-read written files, collect_schema to inspect type metadata, shape to check dimensions, and equals to perform full content comparison. This pattern is composed from standard Polars APIs and can be adapted to any output format.

Usage

Apply this pattern after any critical write operation. Compose the individual checks (schema, shape, content) based on the required validation level. For performance-sensitive pipelines, schema and shape checks alone provide fast structural validation. For critical data, add the full content equality check.

Code Reference

Source Location

  • Repository: polars
  • Files:
    • docs/source/src/python/user-guide/io/parquet.py (Line: 8)
    • docs/source/src/python/user-guide/lazy/schema.py (Lines: 7-10)

Pattern Interface

This is a Pattern Doc documenting a composed validation approach rather than a single function signature. The pattern uses the following Polars APIs:

# Re-read written data
pl.read_parquet(file: str) -> DataFrame
pl.read_csv(file: str) -> DataFrame
pl.read_json(file: str) -> DataFrame

# Schema inspection
DataFrame.collect_schema() -> Schema
LazyFrame.collect_schema() -> Schema

# Dimensional check
DataFrame.shape -> tuple[int, int]  # (n_rows, n_columns)

# Content equality
DataFrame.equals(other: DataFrame) -> bool

Import

import polars as pl

I/O Contract

Inputs

Name Type Required Description
file str Yes Path to the written output file to validate
df_original polars.DataFrame Yes The original DataFrame that was written, used as the reference for comparison

Outputs

Name Type Description
schema_match bool Whether the schema of the round-trip DataFrame matches the original (via assert or comparison)
shape_match bool Whether the shape (rows, columns) matches the original
content_match bool Whether all values are identical (via DataFrame.equals)

Usage Examples

Basic Round-Trip Validation (Parquet)

import polars as pl

# Write data
df_original = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
df_original.write_parquet("output.parquet")

# Read back and validate
df_roundtrip = pl.read_parquet("output.parquet")

# Schema validation
assert df_original.collect_schema() == df_roundtrip.collect_schema()

# Shape validation
assert df_original.shape == df_roundtrip.shape

# Content validation
assert df_original.equals(df_roundtrip)

Validation with Detailed Error Reporting

import polars as pl

def validate_output(df_original: pl.DataFrame, output_path: str) -> dict:
    """Validate a written file against the original DataFrame."""
    df_roundtrip = pl.read_parquet(output_path)

    results = {}

    # Level 1: Schema validation
    original_schema = df_original.collect_schema()
    roundtrip_schema = df_roundtrip.collect_schema()
    results["schema_match"] = original_schema == roundtrip_schema
    if not results["schema_match"]:
        results["schema_diff"] = {
            "original": dict(original_schema),
            "roundtrip": dict(roundtrip_schema),
        }

    # Level 2: Shape validation
    results["shape_match"] = df_original.shape == df_roundtrip.shape
    if not results["shape_match"]:
        results["shape_diff"] = {
            "original": df_original.shape,
            "roundtrip": df_roundtrip.shape,
        }

    # Level 3: Content validation
    results["content_match"] = df_original.equals(df_roundtrip)

    return results


# Usage
df = pl.DataFrame({
    "id": [1, 2, 3],
    "value": [10.5, 20.3, 30.1],
    "label": ["a", "b", "c"],
})
df.write_parquet("validated_output.parquet")

validation = validate_output(df, "validated_output.parquet")
assert all(validation[k] for k in ["schema_match", "shape_match", "content_match"])

CSV Round-Trip with Schema Awareness

import polars as pl

# CSV loses type information, so schema may differ on read-back
df_original = pl.DataFrame({
    "date": pl.Series(["2025-01-01", "2025-01-02"]).str.to_date("%Y-%m-%d"),
    "amount": [100, 200],
})
df_original.write_csv("output.csv")

# Read back with type parsing to restore schema
df_roundtrip = pl.read_csv("output.csv", try_parse_dates=True)

# Shape should always match
assert df_original.shape == df_roundtrip.shape

# For CSV, cast columns back to expected types before content comparison
df_roundtrip = df_roundtrip.with_columns(
    pl.col("date").cast(pl.Date),
    pl.col("amount").cast(pl.Int64),
)
assert df_original.equals(df_roundtrip)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment