Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Kubeflow Pipelines Resource Sizing For Components

From Leeroopedia
Knowledge Sources
Domains Optimization, ML_Pipelines, Kubernetes
Last Updated 2026-02-13 13:35 GMT

Overview

Memory sizing guideline: V2 Python components require 650M+ due to KFP SDK installation overhead; XGBoost tasks empirically need 1Gi per step.

Description

KFP V2 Python components install the full kfp SDK package into each task container at runtime. This causes a significant memory overhead beyond the component's own data processing needs. For a simple component that allocates ~400Mi of data, the total memory requirement jumps to 650M+ when accounting for SDK installation. For ML workloads like XGBoost training and prediction, empirical testing shows 1Gi memory is needed per task. Additionally, resource requests (as opposed to limits) are not yet fully supported via the SDK API, and GPU/accelerator types cannot be dynamically parameterized from component outputs.

Usage

Use this heuristic when sizing pipeline task resources using .set_memory_limit() and .set_cpu_limit(). Apply it immediately when:

  • Writing any V2 Python component (always budget 650M+ for SDK overhead)
  • Running XGBoost training or prediction tasks (use 1Gi baseline)
  • Encountering OOMKilled errors in pipeline task pods
  • Planning resource quotas for pipeline namespaces

The Insight (Rule of Thumb)

  • Action: Always call .set_memory_limit() on every task, especially V2 Python components.
  • Value: Minimum 650M for basic V2 Python components. 1Gi for XGBoost and data-intensive tasks.
  • Trade-off: Over-provisioning memory reduces cluster efficiency; under-provisioning causes OOMKilled failures.
  • Limitation: Resource limits cannot be set dynamically from component outputs (PipelineParameterChannel not supported). GPU accelerator types must also be hardcoded strings.

Reasoning

V2 Python components use a mechanism where the KFP SDK is installed into the task container at runtime via pip install. This process consumes significant memory (200M+) on top of the base container footprint. The XGBoost component wrappers load datasets into memory for training and prediction, and the combination of data loading, model training, and SDK overhead consistently requires approximately 1Gi. These values were determined through experimentation by the KFP team, as noted in the code comments.

The lack of dynamic resource parameterization means that memory limits must be statically defined in the pipeline code. This is a known limitation tracked in GitHub issue #6354.

Evidence from samples/core/resource_spec/resource_spec.py:31-35:

# 11234567 roughly needs 400Mi+ memory.
#
# Note, with v2 python components, there's a larger memory overhead caused
# by installing KFP SDK in the component, so we had to increase memory limit to 650M.
training_task = training_op(n=n).set_cpu_limit('1').set_memory_limit('650M')

Evidence from samples/core/XGBoost/xgboost_sample.py:26:

# Based on experimentation, many steps need 1Gi memory.

Evidence of applied 1Gi limits from samples/core/XGBoost/xgboost_sample.py:40,46,57,63,70,76:

model_trained_on_csv = xgboost_train_on_csv_op(
    training_data=training_data_csv,
    label_column=0,
    objective='reg:squarederror',
    num_iterations=200,
).set_memory_limit('1Gi').outputs['model']

Evidence of dynamic parameterization limitation from samples/core/resource_spec/runtime_resource_request.py:37-46:

# TODO: support PipelineParameterChannel for resource input
# TypeError: expected string or bytes-like object, got 'PipelineParameterChannel'
traning_task = training_op(n=n)\
    .set_memory_limit('500Mi')\
    .set_cpu_limit('200m')\
    .set_cpu_request('200m')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment