Implementation:Kubeflow Kubeflow Katib Experiment CRD

Knowledge Sources	Kubeflow Katib README
Domains	MLOps, Hyperparameter Optimization, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for automated hyperparameter tuning and neural architecture search provided by the Kubeflow Katib component.

Description

The Katib Experiment CRD is the Kubernetes-native resource through which Katib manages hyperparameter optimization workflows. When an Experiment resource is created, the Katib controller iteratively generates Suggestion resources (hyperparameter proposals from the selected algorithm), creates Trial resources (each launching a training job with the proposed hyperparameters), collects the objective metric from completed trials, and feeds results back to the suggestion algorithm until the experiment's budget is exhausted or convergence criteria are met.

Katib supports multiple search algorithms out of the box: random (Random Search), grid (Grid Search), bayesianoptimization (Gaussian Process), tpe (Tree-structured Parzen Estimators), cmaes (Covariance Matrix Adaptation), hyperband (HyperBand multi-fidelity), enas (Efficient Neural Architecture Search), and darts (Differentiable Architecture Search). Each trial template can reference a Kubeflow TrainJob, a bare Kubernetes Job, or any other workload CRD.

External Reference

Usage

Use the Katib Experiment CRD when:

Automated search over a defined hyperparameter space is needed.
Multiple training trials must be orchestrated in parallel with result collection and comparison.
Early stopping strategies should be applied to reduce compute waste on unpromising configurations.
A complete record of all tried hyperparameter configurations and their metrics is required.
Neural architecture search is desired as part of the model development process.

Code Reference

Source Location

Repository: kubeflow/katib
File: config/crd/bases/katib.kubeflow.org_experiments.yaml (CRD schema)

Signature

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: <experiment-name>
  namespace: <namespace>
spec:
  algorithm:
    algorithmName: <random|grid|bayesianoptimization|tpe|cmaes|hyperband|enas|darts>
    algorithmSettings: []
  objective:
    type: <maximize|minimize>
    goal: <target-value>
    objectiveMetricName: <metric-name>
    additionalMetricNames: []
  parameters:
    - name: <param-name>
      parameterType: <int|double|categorical|discrete>
      feasibleSpace:
        min: "<min-value>"
        max: "<max-value>"
        step: "<step-value>"
        list: []
  parallelTrialCount: <max-parallel-trials>
  maxTrialCount: <total-trial-budget>
  maxFailedTrialCount: <max-failures>
  earlyStopping:
    algorithmName: <medianstop|...>
  trialTemplate:
    primaryContainerName: <container-name>
    trialParameters: []
    trialSpec: <Job|TrainJob spec>

Import

# Install Katib components
kubectl apply -k "github.com/kubeflow/katib/manifests/v1beta1/installs/katib-standalone"

# Submit an Experiment
kubectl apply -f experiment.yaml

I/O Contract

Inputs

Name	Type	Required	Description
metadata.name	string	Yes	Name of the Katib Experiment resource
spec.algorithm.algorithmName	string	Yes	Search algorithm to use (e.g., random, tpe, bayesianoptimization)
spec.objective.type	string	Yes	Optimization direction: maximize or minimize
spec.objective.objectiveMetricName	string	Yes	Name of the metric to optimize
spec.objective.goal	float	No	Target metric value at which to stop the experiment
spec.parameters[]	list	Yes	Hyperparameter search space definitions
spec.maxTrialCount	integer	Yes	Maximum number of trials to run
spec.parallelTrialCount	integer	No	Maximum number of trials to run concurrently
spec.trialTemplate	object	Yes	Template for the training job launched per trial
spec.earlyStopping	object	No	Early stopping algorithm configuration

Outputs

Name	Type	Description
Optimal hyperparameters	Experiment.status.currentOptimalTrial	The best hyperparameter set found across all trials
Trial history	list of Trial resources	Complete record of all trials with parameters and metrics
Best trial model artifacts	files (storage URI)	Model artifacts produced by the best-performing trial
Experiment status	Experiment.status	SUCCEEDED, FAILED, or RUNNING with trial counts

Usage Examples

Basic Usage

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: tune-learning-rate
  namespace: ml-team
spec:
  algorithm:
    algorithmName: tpe
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
  parallelTrialCount: 3
  maxTrialCount: 20
  maxFailedTrialCount: 3
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"
        step: "16"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - "adam"
          - "sgd"
          - "adamw"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: "Learning rate"
        reference: learning_rate
      - name: batchSize
        description: "Batch size"
        reference: batch_size
      - name: optimizer
        description: "Optimizer"
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: my-registry/trainer:latest
                command:
                  - "python"
                  - "train.py"
                  - "--lr=${trialParameters.learningRate}"
                  - "--batch-size=${trialParameters.batchSize}"
                  - "--optimizer=${trialParameters.optimizer}"
                resources:
                  requests:
                    cpu: "4"
                    memory: "8Gi"
                    nvidia.com/gpu: "1"
            restartPolicy: Never

Related Pages

Implements Principle

Principle:Kubeflow_Kubeflow_Tune_Hyperparameters

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment