Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:PacktPublishing LLM Engineers Handbook SFTTrainer Train

From Leeroopedia


Field Value
Implementation Name SFTTrainer Train
Type Wrapper Doc (TRL library)
Source File llm_engineering/model/finetuning/finetune.py:L117-199
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Supervised_Finetuning

Function Signatures

SFT Training

SFTTrainer(
    model,
    tokenizer,
    train_dataset,
    eval_dataset,
    dataset_text_field: str,
    max_seq_length: int,
    args: TrainingArguments,
).train() -> TrainOutput

DPO Training

DPOTrainer(
    model,
    ref_model: Optional[Model],
    tokenizer,
    beta: float,
    train_dataset,
    eval_dataset,
    args: DPOConfig,
).train() -> TrainOutput

Imports

from trl import SFTTrainer, DPOTrainer, DPOConfig
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

Description

The fine-tuning pipeline uses HuggingFace's TRL (Transformer Reinforcement Learning) library to perform both SFT and DPO training. SFTTrainer handles supervised fine-tuning on instruction-response pairs, while DPOTrainer handles preference optimization on chosen/rejected pairs. Both trainers manage the complete training loop including gradient computation, optimizer steps, logging, evaluation, and checkpointing.

SFT Training Implementation

Key Code

# From llm_engineering/model/finetuning/finetune.py

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=TrainingArguments(
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        report_to="comet_ml",
    ),
)
trainer.train()

SFT Parameters

Parameter Type Value Description
model Model The model with LoRA adapters injected.
tokenizer Tokenizer The corresponding tokenizer.
train_dataset Dataset dataset["train"] Training split of the formatted dataset.
eval_dataset Dataset dataset["test"] Evaluation split for validation loss monitoring.
dataset_text_field str "text" Column name containing the formatted training text.
max_seq_length int 2048 Maximum sequence length for tokenization/truncation.
learning_rate float 3e-4 Learning rate for the AdamW optimizer.
num_train_epochs int 3 Number of training epochs.
per_device_train_batch_size int 2 Batch size per GPU device.
gradient_accumulation_steps int 8 Number of steps to accumulate gradients before optimizer update. Effective batch size = 2 * 8 = 16.
optim str "adamw_8bit" 8-bit AdamW optimizer to reduce memory usage.
fp16 / bf16 bool Auto-detected Mixed precision training. BF16 preferred when supported (Ampere+ GPUs).
report_to str "comet_ml" Experiment tracking platform for logging metrics.

DPO Training Implementation

Key Code

# From llm_engineering/model/finetuning/finetune.py

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    tokenizer=tokenizer,
    beta=0.1,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=DPOConfig(
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        gradient_accumulation_steps=8,
        optim="adamw_8bit",
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        report_to="comet_ml",
    ),
)
trainer.train()

DPO-Specific Parameters

Parameter Type Value Description
ref_model Model or None None Reference model for DPO. When None, the trainer uses the initial model weights as the reference (implicit reference).
beta float 0.1 Temperature parameter for the DPO loss. Lower values make the model more conservative in differentiating preferences; higher values make it more aggressive.

Returns

trainer.train() returns a TrainOutput object, but the primary effect is in-place modification of the model's LoRA adapter weights. The fine-tuned model is available via the same model reference after training completes.

Training Metrics

Both trainers log the following metrics to Comet ML during training:

  • Training loss: Per-step and per-epoch loss values.
  • Evaluation loss: Validation loss computed on the test split.
  • Learning rate schedule: Current learning rate at each step.
  • GPU memory usage: Peak memory utilization.

External Dependencies

Package Purpose
trl TRL library providing SFTTrainer and DPOTrainer
transformers TrainingArguments configuration and training infrastructure
unsloth BF16 support detection via is_bfloat16_supported()
comet_ml Experiment tracking and metric logging

External References

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment