Environment:PacktPublishing LLM Engineers Handbook Unsloth Finetuning Environment

Knowledge Sources	LLM Engineers Handbook Unsloth
Domains	Deep_Learning, LLMs, Finetuning
Last Updated	2026-02-08 08:00 GMT

Overview

GPU-accelerated fine-tuning environment with Unsloth, Flash Attention, LoRA/QLoRA, and TRL for SFT and DPO training of Llama 3.1 8B.

Description

This environment provides the complete fine-tuning stack running inside a SageMaker training container. It uses Unsloth for optimized model loading and patching, Flash Attention 2 for efficient attention computation, PEFT for LoRA adapter injection, and TRL for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). The environment targets PyTorch 2.4.0 (different from the local development PyTorch 2.2.2) and includes bitsandbytes for quantization support.

Usage

Use this environment exclusively for the LLM Finetuning workflow. It runs inside a SageMaker `ml.g5.2xlarge` training job and is installed via a separate `requirements.txt` uploaded with the training script. The environment handles loading the base Llama 3.1 8B model, injecting LoRA adapters, running SFT or DPO training, and merging/saving the final model.

System Requirements

Category	Requirement	Notes
Hardware	NVIDIA GPU with CUDA support	Minimum 24GB VRAM (A10G via ml.g5.2xlarge)
CUDA	Compatible with PyTorch 2.4.0	CUDA 11.8 or 12.x
Runtime	SageMaker Training Container	PyTorch 2.1 base, Python 3.10
Disk	~30GB	Model weights + training artifacts

Dependencies

Python Packages (requirements.txt)

`accelerate` = 0.33.0
`torch` = 2.4.0
`transformers` = 4.43.3
`datasets` = 2.20.0
`peft` = 0.12.0
`trl` = 0.9.6
`bitsandbytes` = 0.43.3
`comet-ml` = 3.44.3
`flash-attn` = 2.3.6
`unsloth` = 2024.9.post2

Credentials

The following environment variables are injected into the SageMaker training container:

`HUGGING_FACE_HUB_TOKEN`: HuggingFace token for model downloads
`COMET_API_KEY`: Comet ML key for experiment tracking
`COMET_PROJECT_NAME`: Comet ML project name
`SM_OUTPUT_DATA_DIR`: SageMaker output directory (auto-set)
`SM_MODEL_DIR`: SageMaker model directory (auto-set)
`SM_NUM_GPUS`: Number of GPUs (auto-set)

Quick Install

# These packages are installed automatically inside the SageMaker container.
# For local testing (requires CUDA GPU):
pip install accelerate==0.33.0 torch==2.4.0 transformers==4.43.3 \
    datasets==2.20.0 peft==0.12.0 trl==0.9.6 bitsandbytes==0.43.3 \
    comet-ml==3.44.3 flash-attn==2.3.6 unsloth==2024.9.post2

Code Evidence

Unsloth imports from `llm_engineering/model/finetuning/finetune.py:5,17-18`:

from unsloth import PatchDPOTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template

SageMaker environment variable access from `llm_engineering/model/finetuning/finetune.py:257-259`:

parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])

CUDA device usage from `llm_engineering/model/finetuning/finetune.py:212`:

inputs = tokenizer([message], return_tensors="pt").to("cuda")

Requirements file from `llm_engineering/model/finetuning/requirements.txt`:

accelerate==0.33.0
torch==2.4.0
transformers==4.43.3
datasets==2.20.0
peft==0.12.0
trl==0.9.6
bitsandbytes==0.43.3
comet-ml==3.44.3
flash-attn==2.3.6
unsloth==2024.9.post2

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Model + training data exceeds VRAM	Reduce `per_device_train_batch_size` or use gradient checkpointing
`flash-attn installation failed`	Missing CUDA toolkit headers	Ensure CUDA development toolkit is installed on host
`FileNotFoundError: requirements.txt`	Requirements file path incorrect	Verify `finetuning_requirements_path` in sagemaker.py
`ImportError: unsloth`	Unsloth not installed in container	Check that requirements.txt is correctly uploaded with training job

Compatibility Notes

PyTorch Version Mismatch: Local development uses PyTorch 2.2.2, but the SageMaker training container uses PyTorch 2.4.0. This is intentional: Unsloth and flash-attn require the newer version.
flash-attn: Requires compilation from source, needs CUDA toolkit headers. Pre-built wheels are available for common CUDA versions.
bfloat16: The training code auto-detects bfloat16 support via `is_bfloat16_supported()` and falls back to fp16 if unavailable.
Unsloth: Provides optimized forward/backward passes for Llama models, reducing memory usage and training time.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment