Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook Run Finetuning On Sagemaker

From Leeroopedia


Field Value
Implementation Name Run Finetuning On Sagemaker
Type API Doc
Source File llm_engineering/model/finetuning/sagemaker.py:L17-69
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_SageMaker_Training_Orchestration

Function Signature

def run_finetuning_on_sagemaker(
    finetuning_type: str,
    num_train_epochs: int,
    per_device_train_batch_size: int,
    learning_rate: float,
    dataset_huggingface_workspace: str,
    is_dummy: bool,
) -> None

Import

from llm_engineering.model.finetuning.sagemaker import run_finetuning_on_sagemaker

Description

This function orchestrates the submission of an LLM fine-tuning job to AWS SageMaker. It constructs a HuggingFace Estimator with all necessary configuration -- instance type, hyperparameters, dependencies, and entry point -- then calls .fit() to launch the managed training job.

The function does not perform any training itself; it delegates execution to SageMaker, which provisions a GPU instance, sets up the container, and runs the finetune.py entry point script.

Parameters

Parameter Type Default Description
finetuning_type str "sft" Type of fine-tuning to perform. Either "sft" (Supervised Fine-Tuning) or "dpo" (Direct Preference Optimization).
num_train_epochs int 3 Number of training epochs.
per_device_train_batch_size int 2 Batch size per GPU device.
learning_rate float 3e-4 Learning rate for the optimizer.
dataset_huggingface_workspace str HuggingFace workspace containing the training dataset.
is_dummy bool False If True, runs a minimal training job for testing purposes.

Returns

None -- The function submits the job and blocks until completion. Model artifacts are saved to S3 by SageMaker.

Key Implementation Details

SageMaker Estimator Configuration

from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    entry_point="finetune.py",
    source_dir=str(Path(__file__).resolve().parent),
    instance_type="ml.g5.2xlarge",
    instance_count=1,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
    hyperparameters={
        "finetuning_type": finetuning_type,
        "num_train_epochs": num_train_epochs,
        "per_device_train_batch_size": per_device_train_batch_size,
        "learning_rate": learning_rate,
        "dataset_huggingface_workspace": dataset_huggingface_workspace,
        "is_dummy": is_dummy,
    },
    role=settings.AWS_ARN_ROLE,
    environment={
        "HUGGING_FACE_HUB_TOKEN": settings.HUGGINGFACE_ACCESS_TOKEN,
        "COMET_API_KEY": settings.COMET_API_KEY,
        "COMET_PROJECT": settings.COMET_PROJECT,
        "COMET_WORKSPACE": settings.COMET_WORKSPACE,
    },
)
huggingface_estimator.fit()

Key Aspects

  • Instance type: ml.g5.2xlarge provides an NVIDIA A10G GPU with 24GB VRAM.
  • Entry point: finetune.py is the script that runs inside the SageMaker container.
  • Source directory: The entire finetuning/ directory is packaged and uploaded to the container.
  • Environment variables: HuggingFace tokens, Comet ML keys are passed securely via environment variables.
  • Hyperparameters: Passed as a dictionary and injected as command-line arguments to the entry point.

External Dependencies

Package Purpose
sagemaker AWS SageMaker Python SDK for job submission
huggingface_hub Model/dataset access tokens
loguru Structured logging

Usage Example

from llm_engineering.model.finetuning.sagemaker import run_finetuning_on_sagemaker

# Launch an SFT fine-tuning job on SageMaker
run_finetuning_on_sagemaker(
    finetuning_type="sft",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=3e-4,
    dataset_huggingface_workspace="my-hf-workspace",
    is_dummy=False,
)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment