Implementation:Kubeflow Kubeflow TrainJob CRD Creation
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Distributed Training, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for submitting distributed model training jobs on Kubernetes provided by the Kubeflow Trainer component.
Description
The TrainJob CRD is the Kubernetes-native resource through which Kubeflow Trainer manages distributed training workloads. When a TrainJob resource is created, the Trainer controller provisions the required training pods (initializer, launcher, trainer nodes), configures the distributed communication environment, monitors job health, and reports completion status. The Trainer V2.0 API (targeted for Kubeflow v1.11) introduces a simplified, unified interface that abstracts away framework-specific details behind a common modelConfig, datasetConfig, and trainer specification.
TrainJob supports multiple distributed training runtimes including PyTorch DistributedDataParallel, TensorFlow MultiWorkerMirroredStrategy, MPI-based Horovod, XGBoost distributed, and JAX multi-process training. The CRD specification allows users to declare the number of training nodes, resources per node, and training runtime without manually configuring rank assignment, master addresses, or communication backends.
External Reference
Usage
Use TrainJob CRD creation when:
- A model training job must run as a distributed workload across multiple Kubernetes nodes or GPUs.
- Training must be submitted programmatically from a pipeline step or CI/CD trigger.
- The team requires managed fault tolerance, pod health monitoring, and automatic restart for training jobs.
- A unified API is preferred over framework-specific operator configurations.
Code Reference
Source Location
- Repository: kubeflow/trainer
- File: config/crd/bases/kubeflow.org_trainjobs.yaml (CRD schema)
Signature
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: <trainjob-name>
namespace: <namespace>
spec:
modelConfig:
input:
config: <model-config-reference>
datasetConfig:
storageUri: <dataset-storage-uri>
numNodes: <number-of-training-nodes>
trainer:
image: <training-container-image>
command:
- "torchrun"
- "train.py"
resourcesPerNode:
requests:
cpu: "<cpu>"
memory: "<memory>"
nvidia.com/gpu: "<gpu-count>"
Import
# Install Kubeflow Trainer operator
kubectl apply -k "github.com/kubeflow/trainer/manifests/overlays/standalone"
# Submit a TrainJob
kubectl apply -f trainjob.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| metadata.name | string | Yes | Name of the TrainJob resource |
| metadata.namespace | string | Yes | Kubernetes namespace for the training job |
| spec.modelConfig | object | No | Model configuration reference (pretrained model, config file) |
| spec.datasetConfig | object | No | Dataset storage URI and access configuration |
| spec.numNodes | integer | Yes | Number of distributed training nodes to provision |
| spec.trainer.image | string | Yes | Container image containing the training code and runtime |
| spec.trainer.command | list | No | Entrypoint command for the training container |
| spec.trainer.resourcesPerNode | object | Yes | CPU, memory, and GPU resource requests per training node |
Outputs
| Name | Type | Description |
|---|---|---|
| Trained model artifacts | files (storage URI) | Model weights and configuration saved to the output location |
| Training logs | Kubernetes pod logs | Stdout/stderr logs from all training pods |
| Job completion status | TrainJob.status | SUCCESS, FAILED, or RUNNING status with conditions |
| Training metrics | emitted metrics | Loss, accuracy, and throughput metrics emitted during training |
Usage Examples
Basic Usage
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: pytorch-train-resnet
namespace: ml-team
spec:
numNodes: 2
trainer:
image: my-registry/pytorch-trainer:latest
command:
- "torchrun"
- "--nnodes=2"
- "--nproc_per_node=4"
- "train.py"
- "--epochs=50"
- "--batch-size=256"
resourcesPerNode:
requests:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "4"
limits:
cpu: "16"
memory: "64Gi"
nvidia.com/gpu: "4"
datasetConfig:
storageUri: "s3://ml-datasets/imagenet"
LLM Fine-Tuning with Model Config
apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
name: llm-finetune-job
namespace: llm-team
spec:
modelConfig:
input:
config: "hf://meta-llama/Llama-3-8b"
datasetConfig:
storageUri: "s3://ml-datasets/instruction-tuning"
numNodes: 4
trainer:
image: my-registry/llm-trainer:latest
command:
- "torchrun"
- "finetune.py"
- "--lora-rank=16"
resourcesPerNode:
requests:
cpu: "16"
memory: "128Gi"
nvidia.com/gpu: "8"