Implementation:Kubeflow Kubeflow Katib Experiment CRD
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Hyperparameter Optimization, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for automated hyperparameter tuning and neural architecture search provided by the Kubeflow Katib component.
Description
The Katib Experiment CRD is the Kubernetes-native resource through which Katib manages hyperparameter optimization workflows. When an Experiment resource is created, the Katib controller iteratively generates Suggestion resources (hyperparameter proposals from the selected algorithm), creates Trial resources (each launching a training job with the proposed hyperparameters), collects the objective metric from completed trials, and feeds results back to the suggestion algorithm until the experiment's budget is exhausted or convergence criteria are met.
Katib supports multiple search algorithms out of the box: random (Random Search), grid (Grid Search), bayesianoptimization (Gaussian Process), tpe (Tree-structured Parzen Estimators), cmaes (Covariance Matrix Adaptation), hyperband (HyperBand multi-fidelity), enas (Efficient Neural Architecture Search), and darts (Differentiable Architecture Search). Each trial template can reference a Kubeflow TrainJob, a bare Kubernetes Job, or any other workload CRD.
External Reference
Usage
Use the Katib Experiment CRD when:
- Automated search over a defined hyperparameter space is needed.
- Multiple training trials must be orchestrated in parallel with result collection and comparison.
- Early stopping strategies should be applied to reduce compute waste on unpromising configurations.
- A complete record of all tried hyperparameter configurations and their metrics is required.
- Neural architecture search is desired as part of the model development process.
Code Reference
Source Location
- Repository: kubeflow/katib
- File: config/crd/bases/katib.kubeflow.org_experiments.yaml (CRD schema)
Signature
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: <experiment-name>
namespace: <namespace>
spec:
algorithm:
algorithmName: <random|grid|bayesianoptimization|tpe|cmaes|hyperband|enas|darts>
algorithmSettings: []
objective:
type: <maximize|minimize>
goal: <target-value>
objectiveMetricName: <metric-name>
additionalMetricNames: []
parameters:
- name: <param-name>
parameterType: <int|double|categorical|discrete>
feasibleSpace:
min: "<min-value>"
max: "<max-value>"
step: "<step-value>"
list: []
parallelTrialCount: <max-parallel-trials>
maxTrialCount: <total-trial-budget>
maxFailedTrialCount: <max-failures>
earlyStopping:
algorithmName: <medianstop|...>
trialTemplate:
primaryContainerName: <container-name>
trialParameters: []
trialSpec: <Job|TrainJob spec>
Import
# Install Katib components
kubectl apply -k "github.com/kubeflow/katib/manifests/v1beta1/installs/katib-standalone"
# Submit an Experiment
kubectl apply -f experiment.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| metadata.name | string | Yes | Name of the Katib Experiment resource |
| spec.algorithm.algorithmName | string | Yes | Search algorithm to use (e.g., random, tpe, bayesianoptimization) |
| spec.objective.type | string | Yes | Optimization direction: maximize or minimize |
| spec.objective.objectiveMetricName | string | Yes | Name of the metric to optimize |
| spec.objective.goal | float | No | Target metric value at which to stop the experiment |
| spec.parameters[] | list | Yes | Hyperparameter search space definitions |
| spec.maxTrialCount | integer | Yes | Maximum number of trials to run |
| spec.parallelTrialCount | integer | No | Maximum number of trials to run concurrently |
| spec.trialTemplate | object | Yes | Template for the training job launched per trial |
| spec.earlyStopping | object | No | Early stopping algorithm configuration |
Outputs
| Name | Type | Description |
|---|---|---|
| Optimal hyperparameters | Experiment.status.currentOptimalTrial | The best hyperparameter set found across all trials |
| Trial history | list of Trial resources | Complete record of all trials with parameters and metrics |
| Best trial model artifacts | files (storage URI) | Model artifacts produced by the best-performing trial |
| Experiment status | Experiment.status | SUCCEEDED, FAILED, or RUNNING with trial counts |
Usage Examples
Basic Usage
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: tune-learning-rate
namespace: ml-team
spec:
algorithm:
algorithmName: tpe
objective:
type: maximize
goal: 0.95
objectiveMetricName: accuracy
parallelTrialCount: 3
maxTrialCount: 20
maxFailedTrialCount: 3
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "128"
step: "16"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- "adam"
- "sgd"
- "adamw"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: "Learning rate"
reference: learning_rate
- name: batchSize
description: "Batch size"
reference: batch_size
- name: optimizer
description: "Optimizer"
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: my-registry/trainer:latest
command:
- "python"
- "train.py"
- "--lr=${trialParameters.learningRate}"
- "--batch-size=${trialParameters.batchSize}"
- "--optimizer=${trialParameters.optimizer}"
resources:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
restartPolicy: Never