Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:PacktPublishing LLM Engineers Handbook LLM Finetuning

From Leeroopedia


Knowledge Sources
Domains LLMs, Fine_Tuning, LLM_Ops
Last Updated 2026-02-08 07:45 GMT

Overview

End-to-end process for fine-tuning Llama-3.1-8B using a two-stage approach (SFT then DPO) with QLoRA on AWS SageMaker, producing a personalized "LLM Twin" model pushed to HuggingFace Hub.

Description

This workflow fine-tunes a base Llama-3.1-8B model in two sequential stages. First, Supervised Fine-Tuning (SFT) adapts the base model to follow instructions using the generated instruction dataset combined with a public fine-tuning dataset. Second, Direct Preference Optimization (DPO) aligns the SFT model with human preferences using the generated preference dataset. Both stages use Unsloth for memory-efficient 4-bit quantization and LoRA adapter injection, and are executed as SageMaker training jobs orchestrated by ZenML. The final merged models are pushed to HuggingFace Hub.

Usage

Execute this workflow after the Dataset Generation pipeline has produced both instruction and preference datasets on HuggingFace Hub. You need AWS SageMaker properly configured with GPU instances (ml.g5.2xlarge). Run the SFT stage first, then change the configuration to DPO and run again, as DPO depends on the SFT model output.

Execution Steps

Step 1: SageMaker Job Configuration

Configure and launch a SageMaker training job through ZenML. The pipeline creates a HuggingFace SageMaker estimator with the fine-tuning script, dependencies, and hyperparameters (learning rate, epochs, batch size, fine-tuning type). Environment variables for HuggingFace and Comet ML authentication are injected into the training container.

Key considerations:

  • Instance type is ml.g5.2xlarge (24GB GPU VRAM)
  • The finetuning_type parameter controls whether SFT or DPO training runs
  • Hyperparameters are passed from the YAML config through ZenML to SageMaker
  • Comet ML experiment tracking is enabled via environment variables

Step 2: Model Loading with Quantization

Load the base model (Llama-3.1-8B for SFT, or the SFT output model for DPO) using Unsloth's FastLanguageModel with 4-bit quantization. This reduces the memory footprint from ~16GB to ~4GB, enabling training on consumer-grade GPUs.

Key considerations:

  • For SFT: loads meta-llama/Llama-3.1-8B as the base model
  • For DPO: loads the previously trained TwinLlama-3.1-8B from HuggingFace Hub
  • A fallback mechanism checks if the SFT model exists and defaults to a public model if not
  • Max sequence length is configurable (default 2048 tokens)

Step 3: LoRA Adapter Injection

Inject Low-Rank Adaptation (LoRA) matrices into the model's attention and feed-forward layers using Unsloth's get_peft_model. Only these small adapter weights (typically less than 1% of total parameters) are trained, preserving the base model's knowledge while enabling domain adaptation.

Key considerations:

  • Target modules: q_proj, k_proj, v_proj, up_proj, down_proj, o_proj, gate_proj
  • LoRA rank and alpha are both set to 32 by default
  • Dropout is set to 0.0 (no dropout on adapter weights)
  • The chat template is applied to the tokenizer (ChatML format)

Step 4: Dataset Preparation

Load the training dataset from HuggingFace Hub and format it for the chosen fine-tuning approach. For SFT, the instruction dataset is combined with a public dataset (FineTome-Alpaca-100k) and formatted into the Alpaca prompt template. For DPO, the preference dataset is loaded and formatted into prompt/chosen/rejected triples. Both are split into train/test sets.

Key considerations:

  • SFT concatenates the custom llmtwin dataset with FineTome-Alpaca-100k (10K samples)
  • DPO uses only the llmtwin-dpo preference dataset
  • A dummy mode limits datasets to 400 samples for testing
  • EOS tokens are appended to ensure proper sequence termination
  • Train/test split uses a 95/5 ratio

Step 5: Training Execution

Execute the training loop using TRL's SFTTrainer (for SFT) or DPOTrainer (for DPO). Training uses 8-bit AdamW optimizer, linear learning rate scheduling, and automatic mixed precision (bf16 where supported). Progress and metrics are logged to Comet ML for experiment tracking.

Key considerations:

  • SFT uses SFTTrainer with sequence packing enabled for efficiency
  • DPO uses DPOTrainer with a configurable beta parameter (default 0.5)
  • Learning rate: 3e-4 for SFT, 2e-6 for DPO
  • Gradient accumulation steps: 8 (effective batch size = 16)
  • Training reports to Comet ML for monitoring loss, learning rate, and throughput

Step 6: Inference Validation

Run a quick inference check on the trained model using a sample prompt to verify the model generates coherent output. This uses Unsloth's inference mode with text streaming.

Key considerations:

  • A fixed validation prompt tests basic generation capability
  • Text streaming provides real-time output for quick visual inspection
  • This is a smoke test, not a comprehensive evaluation

Step 7: Model Saving and Publishing

Merge the LoRA adapter weights back into the base model at 16-bit precision and save the result. The merged model is both saved locally and pushed to HuggingFace Hub for downstream consumption (evaluation, deployment).

Key considerations:

  • Models are saved using Unsloth's save_pretrained_merged with merged_16bit method
  • SFT output is pushed as TwinLlama-3.1-8B
  • DPO output is pushed as TwinLlama-3.1-8B-DPO
  • The HuggingFace workspace is determined by the authenticated user's account

Execution Diagram

GitHub URL

Workflow Repository