Implementation:Kubeflow Kubeflow TrainJob CRD Creation

Knowledge Sources	Kubeflow Trainer README
Domains	MLOps, Distributed Training, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for submitting distributed model training jobs on Kubernetes provided by the Kubeflow Trainer component.

Description

The TrainJob CRD is the Kubernetes-native resource through which Kubeflow Trainer manages distributed training workloads. When a TrainJob resource is created, the Trainer controller provisions the required training pods (initializer, launcher, trainer nodes), configures the distributed communication environment, monitors job health, and reports completion status. The Trainer V2.0 API (targeted for Kubeflow v1.11) introduces a simplified, unified interface that abstracts away framework-specific details behind a common modelConfig, datasetConfig, and trainer specification.

TrainJob supports multiple distributed training runtimes including PyTorch DistributedDataParallel, TensorFlow MultiWorkerMirroredStrategy, MPI-based Horovod, XGBoost distributed, and JAX multi-process training. The CRD specification allows users to declare the number of training nodes, resources per node, and training runtime without manually configuring rank assignment, master addresses, or communication backends.

External Reference

Usage

Use TrainJob CRD creation when:

A model training job must run as a distributed workload across multiple Kubernetes nodes or GPUs.
Training must be submitted programmatically from a pipeline step or CI/CD trigger.
The team requires managed fault tolerance, pod health monitoring, and automatic restart for training jobs.
A unified API is preferred over framework-specific operator configurations.

Code Reference

Source Location

Repository: kubeflow/trainer
File: config/crd/bases/kubeflow.org_trainjobs.yaml (CRD schema)

Signature

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: <trainjob-name>
  namespace: <namespace>
spec:
  modelConfig:
    input:
      config: <model-config-reference>
  datasetConfig:
    storageUri: <dataset-storage-uri>
  numNodes: <number-of-training-nodes>
  trainer:
    image: <training-container-image>
    command:
      - "torchrun"
      - "train.py"
    resourcesPerNode:
      requests:
        cpu: "<cpu>"
        memory: "<memory>"
        nvidia.com/gpu: "<gpu-count>"

Import

# Install Kubeflow Trainer operator
kubectl apply -k "github.com/kubeflow/trainer/manifests/overlays/standalone"

# Submit a TrainJob
kubectl apply -f trainjob.yaml

I/O Contract

Inputs

Name	Type	Required	Description
metadata.name	string	Yes	Name of the TrainJob resource
metadata.namespace	string	Yes	Kubernetes namespace for the training job
spec.modelConfig	object	No	Model configuration reference (pretrained model, config file)
spec.datasetConfig	object	No	Dataset storage URI and access configuration
spec.numNodes	integer	Yes	Number of distributed training nodes to provision
spec.trainer.image	string	Yes	Container image containing the training code and runtime
spec.trainer.command	list	No	Entrypoint command for the training container
spec.trainer.resourcesPerNode	object	Yes	CPU, memory, and GPU resource requests per training node

Outputs

Name	Type	Description
Trained model artifacts	files (storage URI)	Model weights and configuration saved to the output location
Training logs	Kubernetes pod logs	Stdout/stderr logs from all training pods
Job completion status	TrainJob.status	SUCCESS, FAILED, or RUNNING status with conditions
Training metrics	emitted metrics	Loss, accuracy, and throughput metrics emitted during training

Usage Examples

Basic Usage

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: pytorch-train-resnet
  namespace: ml-team
spec:
  numNodes: 2
  trainer:
    image: my-registry/pytorch-trainer:latest
    command:
      - "torchrun"
      - "--nnodes=2"
      - "--nproc_per_node=4"
      - "train.py"
      - "--epochs=50"
      - "--batch-size=256"
    resourcesPerNode:
      requests:
        cpu: "8"
        memory: "32Gi"
        nvidia.com/gpu: "4"
      limits:
        cpu: "16"
        memory: "64Gi"
        nvidia.com/gpu: "4"
  datasetConfig:
    storageUri: "s3://ml-datasets/imagenet"

LLM Fine-Tuning with Model Config

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: llm-finetune-job
  namespace: llm-team
spec:
  modelConfig:
    input:
      config: "hf://meta-llama/Llama-3-8b"
  datasetConfig:
    storageUri: "s3://ml-datasets/instruction-tuning"
  numNodes: 4
  trainer:
    image: my-registry/llm-trainer:latest
    command:
      - "torchrun"
      - "finetune.py"
      - "--lora-rank=16"
    resourcesPerNode:
      requests:
        cpu: "16"
        memory: "128Gi"
        nvidia.com/gpu: "8"

Related Pages

Implements Principle

Principle:Kubeflow_Kubeflow_Train_Model

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment