Implementation:Pola rs Polars Output Validation Pattern
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, ETL_Validation, Data_Engineering |
| Last Updated | 2026-02-09 10:00 GMT |
| Type | Pattern Doc |
Overview
A validation pattern using Polars read operations, schema inspection, and equality checks to verify data integrity after write operations through round-trip comparison.
Description
The Output Validation Pattern documents a reusable pattern (not a single API) for verifying that written data faithfully represents the original DataFrame. It combines pl.read_parquet (or other read functions) to re-read written files, collect_schema to inspect type metadata, shape to check dimensions, and equals to perform full content comparison. This pattern is composed from standard Polars APIs and can be adapted to any output format.
Usage
Apply this pattern after any critical write operation. Compose the individual checks (schema, shape, content) based on the required validation level. For performance-sensitive pipelines, schema and shape checks alone provide fast structural validation. For critical data, add the full content equality check.
Code Reference
Source Location
- Repository: polars
- Files:
- docs/source/src/python/user-guide/io/parquet.py (Line: 8)
- docs/source/src/python/user-guide/lazy/schema.py (Lines: 7-10)
Pattern Interface
This is a Pattern Doc documenting a composed validation approach rather than a single function signature. The pattern uses the following Polars APIs:
# Re-read written data
pl.read_parquet(file: str) -> DataFrame
pl.read_csv(file: str) -> DataFrame
pl.read_json(file: str) -> DataFrame
# Schema inspection
DataFrame.collect_schema() -> Schema
LazyFrame.collect_schema() -> Schema
# Dimensional check
DataFrame.shape -> tuple[int, int] # (n_rows, n_columns)
# Content equality
DataFrame.equals(other: DataFrame) -> bool
Import
import polars as pl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file | str | Yes | Path to the written output file to validate |
| df_original | polars.DataFrame | Yes | The original DataFrame that was written, used as the reference for comparison |
Outputs
| Name | Type | Description |
|---|---|---|
| schema_match | bool | Whether the schema of the round-trip DataFrame matches the original (via assert or comparison) |
| shape_match | bool | Whether the shape (rows, columns) matches the original |
| content_match | bool | Whether all values are identical (via DataFrame.equals) |
Usage Examples
Basic Round-Trip Validation (Parquet)
import polars as pl
# Write data
df_original = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
df_original.write_parquet("output.parquet")
# Read back and validate
df_roundtrip = pl.read_parquet("output.parquet")
# Schema validation
assert df_original.collect_schema() == df_roundtrip.collect_schema()
# Shape validation
assert df_original.shape == df_roundtrip.shape
# Content validation
assert df_original.equals(df_roundtrip)
Validation with Detailed Error Reporting
import polars as pl
def validate_output(df_original: pl.DataFrame, output_path: str) -> dict:
"""Validate a written file against the original DataFrame."""
df_roundtrip = pl.read_parquet(output_path)
results = {}
# Level 1: Schema validation
original_schema = df_original.collect_schema()
roundtrip_schema = df_roundtrip.collect_schema()
results["schema_match"] = original_schema == roundtrip_schema
if not results["schema_match"]:
results["schema_diff"] = {
"original": dict(original_schema),
"roundtrip": dict(roundtrip_schema),
}
# Level 2: Shape validation
results["shape_match"] = df_original.shape == df_roundtrip.shape
if not results["shape_match"]:
results["shape_diff"] = {
"original": df_original.shape,
"roundtrip": df_roundtrip.shape,
}
# Level 3: Content validation
results["content_match"] = df_original.equals(df_roundtrip)
return results
# Usage
df = pl.DataFrame({
"id": [1, 2, 3],
"value": [10.5, 20.3, 30.1],
"label": ["a", "b", "c"],
})
df.write_parquet("validated_output.parquet")
validation = validate_output(df, "validated_output.parquet")
assert all(validation[k] for k in ["schema_match", "shape_match", "content_match"])
CSV Round-Trip with Schema Awareness
import polars as pl
# CSV loses type information, so schema may differ on read-back
df_original = pl.DataFrame({
"date": pl.Series(["2025-01-01", "2025-01-02"]).str.to_date("%Y-%m-%d"),
"amount": [100, 200],
})
df_original.write_csv("output.csv")
# Read back with type parsing to restore schema
df_roundtrip = pl.read_csv("output.csv", try_parse_dates=True)
# Shape should always match
assert df_original.shape == df_roundtrip.shape
# For CSV, cast columns back to expected types before content comparison
df_roundtrip = df_roundtrip.with_columns(
pl.col("date").cast(pl.Date),
pl.col("amount").cast(pl.Int64),
)
assert df_original.equals(df_roundtrip)