Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kubeflow Kubeflow TrainJob CRD Creation

From Leeroopedia
Knowledge Sources
Domains MLOps, Distributed Training, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for submitting distributed model training jobs on Kubernetes provided by the Kubeflow Trainer component.

Description

The TrainJob CRD is the Kubernetes-native resource through which Kubeflow Trainer manages distributed training workloads. When a TrainJob resource is created, the Trainer controller provisions the required training pods (initializer, launcher, trainer nodes), configures the distributed communication environment, monitors job health, and reports completion status. The Trainer V2.0 API (targeted for Kubeflow v1.11) introduces a simplified, unified interface that abstracts away framework-specific details behind a common modelConfig, datasetConfig, and trainer specification.

TrainJob supports multiple distributed training runtimes including PyTorch DistributedDataParallel, TensorFlow MultiWorkerMirroredStrategy, MPI-based Horovod, XGBoost distributed, and JAX multi-process training. The CRD specification allows users to declare the number of training nodes, resources per node, and training runtime without manually configuring rank assignment, master addresses, or communication backends.

External Reference

Usage

Use TrainJob CRD creation when:

  • A model training job must run as a distributed workload across multiple Kubernetes nodes or GPUs.
  • Training must be submitted programmatically from a pipeline step or CI/CD trigger.
  • The team requires managed fault tolerance, pod health monitoring, and automatic restart for training jobs.
  • A unified API is preferred over framework-specific operator configurations.

Code Reference

Source Location

  • Repository: kubeflow/trainer
  • File: config/crd/bases/kubeflow.org_trainjobs.yaml (CRD schema)

Signature

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: <trainjob-name>
  namespace: <namespace>
spec:
  modelConfig:
    input:
      config: <model-config-reference>
  datasetConfig:
    storageUri: <dataset-storage-uri>
  numNodes: <number-of-training-nodes>
  trainer:
    image: <training-container-image>
    command:
      - "torchrun"
      - "train.py"
    resourcesPerNode:
      requests:
        cpu: "<cpu>"
        memory: "<memory>"
        nvidia.com/gpu: "<gpu-count>"

Import

# Install Kubeflow Trainer operator
kubectl apply -k "github.com/kubeflow/trainer/manifests/overlays/standalone"

# Submit a TrainJob
kubectl apply -f trainjob.yaml

I/O Contract

Inputs

Name Type Required Description
metadata.name string Yes Name of the TrainJob resource
metadata.namespace string Yes Kubernetes namespace for the training job
spec.modelConfig object No Model configuration reference (pretrained model, config file)
spec.datasetConfig object No Dataset storage URI and access configuration
spec.numNodes integer Yes Number of distributed training nodes to provision
spec.trainer.image string Yes Container image containing the training code and runtime
spec.trainer.command list No Entrypoint command for the training container
spec.trainer.resourcesPerNode object Yes CPU, memory, and GPU resource requests per training node

Outputs

Name Type Description
Trained model artifacts files (storage URI) Model weights and configuration saved to the output location
Training logs Kubernetes pod logs Stdout/stderr logs from all training pods
Job completion status TrainJob.status SUCCESS, FAILED, or RUNNING status with conditions
Training metrics emitted metrics Loss, accuracy, and throughput metrics emitted during training

Usage Examples

Basic Usage

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: pytorch-train-resnet
  namespace: ml-team
spec:
  numNodes: 2
  trainer:
    image: my-registry/pytorch-trainer:latest
    command:
      - "torchrun"
      - "--nnodes=2"
      - "--nproc_per_node=4"
      - "train.py"
      - "--epochs=50"
      - "--batch-size=256"
    resourcesPerNode:
      requests:
        cpu: "8"
        memory: "32Gi"
        nvidia.com/gpu: "4"
      limits:
        cpu: "16"
        memory: "64Gi"
        nvidia.com/gpu: "4"
  datasetConfig:
    storageUri: "s3://ml-datasets/imagenet"

LLM Fine-Tuning with Model Config

apiVersion: kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: llm-finetune-job
  namespace: llm-team
spec:
  modelConfig:
    input:
      config: "hf://meta-llama/Llama-3-8b"
  datasetConfig:
    storageUri: "s3://ml-datasets/instruction-tuning"
  numNodes: 4
  trainer:
    image: my-registry/llm-trainer:latest
    command:
      - "torchrun"
      - "finetune.py"
      - "--lora-rank=16"
    resourcesPerNode:
      requests:
        cpu: "16"
        memory: "128Gi"
        nvidia.com/gpu: "8"

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment