Implementation:Pola rs Polars Output Validation Pattern

Knowledge Sources	polars Polars User Guide - Lazy API Schema
Domains	Data_Quality, ETL_Validation, Data_Engineering
Last Updated	2026-02-09 10:00 GMT
Type	Pattern Doc

Overview

A validation pattern using Polars read operations, schema inspection, and equality checks to verify data integrity after write operations through round-trip comparison.

Description

The Output Validation Pattern documents a reusable pattern (not a single API) for verifying that written data faithfully represents the original DataFrame. It combines pl.read_parquet (or other read functions) to re-read written files, collect_schema to inspect type metadata, shape to check dimensions, and equals to perform full content comparison. This pattern is composed from standard Polars APIs and can be adapted to any output format.

Usage

Apply this pattern after any critical write operation. Compose the individual checks (schema, shape, content) based on the required validation level. For performance-sensitive pipelines, schema and shape checks alone provide fast structural validation. For critical data, add the full content equality check.

Code Reference

Source Location

Repository: polars
Files:
- docs/source/src/python/user-guide/io/parquet.py (Line: 8)
- docs/source/src/python/user-guide/lazy/schema.py (Lines: 7-10)

Pattern Interface

This is a Pattern Doc documenting a composed validation approach rather than a single function signature. The pattern uses the following Polars APIs:

# Re-read written data
pl.read_parquet(file: str) -> DataFrame
pl.read_csv(file: str) -> DataFrame
pl.read_json(file: str) -> DataFrame

# Schema inspection
DataFrame.collect_schema() -> Schema
LazyFrame.collect_schema() -> Schema

# Dimensional check
DataFrame.shape -> tuple[int, int]  # (n_rows, n_columns)

# Content equality
DataFrame.equals(other: DataFrame) -> bool

Import

import polars as pl

I/O Contract

Inputs

Name	Type	Required	Description
file	str	Yes	Path to the written output file to validate
df_original	polars.DataFrame	Yes	The original DataFrame that was written, used as the reference for comparison

Outputs

Name	Type	Description
schema_match	bool	Whether the schema of the round-trip DataFrame matches the original (via assert or comparison)
shape_match	bool	Whether the shape (rows, columns) matches the original
content_match	bool	Whether all values are identical (via DataFrame.equals)

Usage Examples

Basic Round-Trip Validation (Parquet)

import polars as pl

# Write data
df_original = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
df_original.write_parquet("output.parquet")

# Read back and validate
df_roundtrip = pl.read_parquet("output.parquet")

# Schema validation
assert df_original.collect_schema() == df_roundtrip.collect_schema()

# Shape validation
assert df_original.shape == df_roundtrip.shape

# Content validation
assert df_original.equals(df_roundtrip)

Validation with Detailed Error Reporting

import polars as pl

def validate_output(df_original: pl.DataFrame, output_path: str) -> dict:
    """Validate a written file against the original DataFrame."""
    df_roundtrip = pl.read_parquet(output_path)

    results = {}

    # Level 1: Schema validation
    original_schema = df_original.collect_schema()
    roundtrip_schema = df_roundtrip.collect_schema()
    results["schema_match"] = original_schema == roundtrip_schema
    if not results["schema_match"]:
        results["schema_diff"] = {
            "original": dict(original_schema),
            "roundtrip": dict(roundtrip_schema),
        }

    # Level 2: Shape validation
    results["shape_match"] = df_original.shape == df_roundtrip.shape
    if not results["shape_match"]:
        results["shape_diff"] = {
            "original": df_original.shape,
            "roundtrip": df_roundtrip.shape,
        }

    # Level 3: Content validation
    results["content_match"] = df_original.equals(df_roundtrip)

    return results


# Usage
df = pl.DataFrame({
    "id": [1, 2, 3],
    "value": [10.5, 20.3, 30.1],
    "label": ["a", "b", "c"],
})
df.write_parquet("validated_output.parquet")

validation = validate_output(df, "validated_output.parquet")
assert all(validation[k] for k in ["schema_match", "shape_match", "content_match"])

CSV Round-Trip with Schema Awareness

import polars as pl

# CSV loses type information, so schema may differ on read-back
df_original = pl.DataFrame({
    "date": pl.Series(["2025-01-01", "2025-01-02"]).str.to_date("%Y-%m-%d"),
    "amount": [100, 200],
})
df_original.write_csv("output.csv")

# Read back with type parsing to restore schema
df_roundtrip = pl.read_csv("output.csv", try_parse_dates=True)

# Shape should always match
assert df_original.shape == df_roundtrip.shape

# For CSV, cast columns back to expected types before content comparison
df_roundtrip = df_roundtrip.with_columns(
    pl.col("date").cast(pl.Date),
    pl.col("amount").cast(pl.Int64),
)
assert df_original.equals(df_roundtrip)

Related Pages

Implements Principle

Principle:Pola_rs_Polars_Data_Output_Validation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment