Principle:Kubeflow Pipelines Cross Format Validation
XGBoost Machine_Learning Validation Last Updated: 2026-02-13
Overview
A validation technique that verifies model correctness by testing predictions across different data formats to ensure format-independent model behavior.
Description
Cross-format validation tests whether a model trained on one data format (CSV) produces consistent predictions when applied to data in another format (Parquet), and vice versa. This validates that serialization/deserialization does not affect model behavior and that the data processing pipeline preserves semantic equivalence across formats.
The technique works by:
- Training a model on format A (e.g., CSV)
- Running predictions on the same logical data stored in format B (e.g., Parquet)
- Comparing prediction outputs to confirm they are identical
If discrepancies are found, they indicate bugs in data loading, type coercion, or column mapping logic rather than model quality issues.
Usage
Use after training models on multiple formats to verify that data format does not introduce prediction discrepancies.
Theoretical Basis
Model invariance under data representation. If data semantics are preserved across formats, model predictions should be identical regardless of the input format. Formally, given a model M trained on dataset D in format F1, and the same dataset D represented in format F2:
- M(DF1) = M(DF2)
Any deviation from this equality signals a defect in the format conversion or data loading layer, not in the model itself.
Related Pages
Implementation:Kubeflow_Pipelines_XGBoost_Cross_Format_Predict