Workflow:PacktPublishing LLM Engineers Handbook LLM Finetuning

Knowledge Sources	LLM Engineers Handbook Unsloth Docs TRL Docs AWS SageMaker Docs
Domains	LLMs, Fine_Tuning, LLM_Ops
Last Updated	2026-02-08 07:45 GMT

Overview

End-to-end process for fine-tuning Llama-3.1-8B using a two-stage approach (SFT then DPO) with QLoRA on AWS SageMaker, producing a personalized "LLM Twin" model pushed to HuggingFace Hub.

Description

This workflow fine-tunes a base Llama-3.1-8B model in two sequential stages. First, Supervised Fine-Tuning (SFT) adapts the base model to follow instructions using the generated instruction dataset combined with a public fine-tuning dataset. Second, Direct Preference Optimization (DPO) aligns the SFT model with human preferences using the generated preference dataset. Both stages use Unsloth for memory-efficient 4-bit quantization and LoRA adapter injection, and are executed as SageMaker training jobs orchestrated by ZenML. The final merged models are pushed to HuggingFace Hub.

Usage

Execute this workflow after the Dataset Generation pipeline has produced both instruction and preference datasets on HuggingFace Hub. You need AWS SageMaker properly configured with GPU instances (ml.g5.2xlarge). Run the SFT stage first, then change the configuration to DPO and run again, as DPO depends on the SFT model output.

Execution Steps

Step 1: SageMaker Job Configuration

Configure and launch a SageMaker training job through ZenML. The pipeline creates a HuggingFace SageMaker estimator with the fine-tuning script, dependencies, and hyperparameters (learning rate, epochs, batch size, fine-tuning type). Environment variables for HuggingFace and Comet ML authentication are injected into the training container.

Key considerations:

Instance type is ml.g5.2xlarge (24GB GPU VRAM)
The finetuning_type parameter controls whether SFT or DPO training runs
Hyperparameters are passed from the YAML config through ZenML to SageMaker
Comet ML experiment tracking is enabled via environment variables

Step 2: Model Loading with Quantization

Load the base model (Llama-3.1-8B for SFT, or the SFT output model for DPO) using Unsloth's FastLanguageModel with 4-bit quantization. This reduces the memory footprint from ~16GB to ~4GB, enabling training on consumer-grade GPUs.

Key considerations:

For SFT: loads meta-llama/Llama-3.1-8B as the base model
For DPO: loads the previously trained TwinLlama-3.1-8B from HuggingFace Hub
A fallback mechanism checks if the SFT model exists and defaults to a public model if not
Max sequence length is configurable (default 2048 tokens)

Step 3: LoRA Adapter Injection

Inject Low-Rank Adaptation (LoRA) matrices into the model's attention and feed-forward layers using Unsloth's get_peft_model. Only these small adapter weights (typically less than 1% of total parameters) are trained, preserving the base model's knowledge while enabling domain adaptation.

Key considerations:

Target modules: q_proj, k_proj, v_proj, up_proj, down_proj, o_proj, gate_proj
LoRA rank and alpha are both set to 32 by default
Dropout is set to 0.0 (no dropout on adapter weights)
The chat template is applied to the tokenizer (ChatML format)

Step 4: Dataset Preparation

Load the training dataset from HuggingFace Hub and format it for the chosen fine-tuning approach. For SFT, the instruction dataset is combined with a public dataset (FineTome-Alpaca-100k) and formatted into the Alpaca prompt template. For DPO, the preference dataset is loaded and formatted into prompt/chosen/rejected triples. Both are split into train/test sets.

Key considerations:

SFT concatenates the custom llmtwin dataset with FineTome-Alpaca-100k (10K samples)
DPO uses only the llmtwin-dpo preference dataset
A dummy mode limits datasets to 400 samples for testing
EOS tokens are appended to ensure proper sequence termination
Train/test split uses a 95/5 ratio

Step 5: Training Execution

Execute the training loop using TRL's SFTTrainer (for SFT) or DPOTrainer (for DPO). Training uses 8-bit AdamW optimizer, linear learning rate scheduling, and automatic mixed precision (bf16 where supported). Progress and metrics are logged to Comet ML for experiment tracking.

Key considerations:

SFT uses SFTTrainer with sequence packing enabled for efficiency
DPO uses DPOTrainer with a configurable beta parameter (default 0.5)
Learning rate: 3e-4 for SFT, 2e-6 for DPO
Gradient accumulation steps: 8 (effective batch size = 16)
Training reports to Comet ML for monitoring loss, learning rate, and throughput

Step 6: Inference Validation

Run a quick inference check on the trained model using a sample prompt to verify the model generates coherent output. This uses Unsloth's inference mode with text streaming.

Key considerations:

A fixed validation prompt tests basic generation capability
Text streaming provides real-time output for quick visual inspection
This is a smoke test, not a comprehensive evaluation

Step 7: Model Saving and Publishing

Merge the LoRA adapter weights back into the base model at 16-bit precision and save the result. The merged model is both saved locally and pushed to HuggingFace Hub for downstream consumption (evaluation, deployment).

Key considerations:

Models are saved using Unsloth's save_pretrained_merged with merged_16bit method
SFT output is pushed as TwinLlama-3.1-8B
DPO output is pushed as TwinLlama-3.1-8B-DPO
The HuggingFace workspace is determined by the authenticated user's account

Execution Diagram

GitHub URL

Workflow Repository