Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Kubeflow Pipelines Parquet Model Training

From Leeroopedia

XGBoost Apache Parquet Machine_Learning Data_Engineering Last Updated: 2026-02-13

Overview

Training machine learning models directly on columnar Parquet data for improved I/O performance and column-level access.

Description

When training data is in Parquet format, specialized training components can leverage columnar access patterns for efficient data loading. Unlike CSV-based training where the label is identified by column index, Parquet training uses column names for type-safe access. This approach is more robust to schema changes.

Usage

Use when training data is already in Parquet format or when working with large datasets where columnar I/O provides performance benefits.

Theoretical Basis

Same gradient boosting theory as CSV training, but data I/O leverages columnar format advantages:

  • Predicate pushdown — filter rows at the storage layer before loading into memory
  • Column pruning — read only the columns needed for training, skipping irrelevant fields
  • Efficient deserialization — columnar encoding (dictionary, run-length, delta) enables faster decode than row-oriented CSV parsing

These properties reduce both memory footprint and wall-clock time for the data-loading phase of training, which is often the bottleneck for large datasets.

Related Pages

Implementation:Kubeflow_Pipelines_XGBoost_Train_On_Parquet_Op

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment