Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kubeflow Kubeflow Katib Experiment CRD

From Leeroopedia
Knowledge Sources
Domains MLOps, Hyperparameter Optimization, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for automated hyperparameter tuning and neural architecture search provided by the Kubeflow Katib component.

Description

The Katib Experiment CRD is the Kubernetes-native resource through which Katib manages hyperparameter optimization workflows. When an Experiment resource is created, the Katib controller iteratively generates Suggestion resources (hyperparameter proposals from the selected algorithm), creates Trial resources (each launching a training job with the proposed hyperparameters), collects the objective metric from completed trials, and feeds results back to the suggestion algorithm until the experiment's budget is exhausted or convergence criteria are met.

Katib supports multiple search algorithms out of the box: random (Random Search), grid (Grid Search), bayesianoptimization (Gaussian Process), tpe (Tree-structured Parzen Estimators), cmaes (Covariance Matrix Adaptation), hyperband (HyperBand multi-fidelity), enas (Efficient Neural Architecture Search), and darts (Differentiable Architecture Search). Each trial template can reference a Kubeflow TrainJob, a bare Kubernetes Job, or any other workload CRD.

External Reference

Usage

Use the Katib Experiment CRD when:

  • Automated search over a defined hyperparameter space is needed.
  • Multiple training trials must be orchestrated in parallel with result collection and comparison.
  • Early stopping strategies should be applied to reduce compute waste on unpromising configurations.
  • A complete record of all tried hyperparameter configurations and their metrics is required.
  • Neural architecture search is desired as part of the model development process.

Code Reference

Source Location

  • Repository: kubeflow/katib
  • File: config/crd/bases/katib.kubeflow.org_experiments.yaml (CRD schema)

Signature

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: <experiment-name>
  namespace: <namespace>
spec:
  algorithm:
    algorithmName: <random|grid|bayesianoptimization|tpe|cmaes|hyperband|enas|darts>
    algorithmSettings: []
  objective:
    type: <maximize|minimize>
    goal: <target-value>
    objectiveMetricName: <metric-name>
    additionalMetricNames: []
  parameters:
    - name: <param-name>
      parameterType: <int|double|categorical|discrete>
      feasibleSpace:
        min: "<min-value>"
        max: "<max-value>"
        step: "<step-value>"
        list: []
  parallelTrialCount: <max-parallel-trials>
  maxTrialCount: <total-trial-budget>
  maxFailedTrialCount: <max-failures>
  earlyStopping:
    algorithmName: <medianstop|...>
  trialTemplate:
    primaryContainerName: <container-name>
    trialParameters: []
    trialSpec: <Job|TrainJob spec>

Import

# Install Katib components
kubectl apply -k "github.com/kubeflow/katib/manifests/v1beta1/installs/katib-standalone"

# Submit an Experiment
kubectl apply -f experiment.yaml

I/O Contract

Inputs

Name Type Required Description
metadata.name string Yes Name of the Katib Experiment resource
spec.algorithm.algorithmName string Yes Search algorithm to use (e.g., random, tpe, bayesianoptimization)
spec.objective.type string Yes Optimization direction: maximize or minimize
spec.objective.objectiveMetricName string Yes Name of the metric to optimize
spec.objective.goal float No Target metric value at which to stop the experiment
spec.parameters[] list Yes Hyperparameter search space definitions
spec.maxTrialCount integer Yes Maximum number of trials to run
spec.parallelTrialCount integer No Maximum number of trials to run concurrently
spec.trialTemplate object Yes Template for the training job launched per trial
spec.earlyStopping object No Early stopping algorithm configuration

Outputs

Name Type Description
Optimal hyperparameters Experiment.status.currentOptimalTrial The best hyperparameter set found across all trials
Trial history list of Trial resources Complete record of all trials with parameters and metrics
Best trial model artifacts files (storage URI) Model artifacts produced by the best-performing trial
Experiment status Experiment.status SUCCEEDED, FAILED, or RUNNING with trial counts

Usage Examples

Basic Usage

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: tune-learning-rate
  namespace: ml-team
spec:
  algorithm:
    algorithmName: tpe
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
  parallelTrialCount: 3
  maxTrialCount: 20
  maxFailedTrialCount: 3
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"
        step: "16"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - "adam"
          - "sgd"
          - "adamw"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: "Learning rate"
        reference: learning_rate
      - name: batchSize
        description: "Batch size"
        reference: batch_size
      - name: optimizer
        description: "Optimizer"
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: my-registry/trainer:latest
                command:
                  - "python"
                  - "train.py"
                  - "--lr=${trialParameters.learningRate}"
                  - "--batch-size=${trialParameters.batchSize}"
                  - "--optimizer=${trialParameters.optimizer}"
                resources:
                  requests:
                    cpu: "4"
                    memory: "8Gi"
                    nvidia.com/gpu: "1"
            restartPolicy: Never

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment