Implementation:Intel Ipex llm Prompter And Get Train Val Data
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tools for prompt formatting and dataset tokenization provided by the IPEX-LLM common finetuning utilities.
Description
The Prompter class loads JSON prompt templates (e.g., Alpaca format) and formats instruction-input-output triples into full prompts. The get_train_val_data function tokenizes a dataset using the Prompter, handles input masking for loss computation, appends EOS tokens, and splits into train/validation sets.
Usage
Import these when preparing instruction-following datasets for LoRA or QLoRA fine-tuning. The Prompter handles template-based formatting, while get_train_val_data handles the full tokenization pipeline.
Code Reference
Source Location
- Repository: IPEX-LLM
- File: python/llm/example/GPU/LLM-Finetuning/common/utils/prompter.py (Prompter, lines 40-83)
- File: python/llm/example/GPU/LLM-Finetuning/common/utils/util.py (get_train_val_data, lines 78-138)
Signature
class Prompter(object):
def __init__(self, template_name: str = "", verbose: bool = False):
"""Load a prompt template from common/templates/{template_name}.json"""
def generate_prompt(
self,
instruction: str,
input: Union[None, str] = None,
label: Union[None, str] = None,
) -> str:
"""Format instruction/input/label into a full prompt string."""
def get_response(self, output: str) -> str:
"""Extract the response portion from model output."""
def get_train_val_data(
data,
tokenizer,
prompter,
train_on_inputs: bool,
add_eos_token: bool,
cutoff_len: int,
val_set_size: int,
seed: int = 42
) -> Tuple[Dataset, Dataset]:
"""Tokenize and split dataset into train and validation sets."""
Import
from common.utils import Prompter, get_train_val_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| template_name | str | No | Prompt template name (default "alpaca"), loads from templates/ directory |
| data | DatasetDict | Yes | HuggingFace dataset with "train" split containing instruction/input/output columns |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer for the target model |
| prompter | Prompter | Yes | Initialized Prompter instance for formatting |
| train_on_inputs | bool | Yes | Whether to include input tokens in loss computation |
| add_eos_token | bool | Yes | Whether to append EOS token to tokenized sequences |
| cutoff_len | int | Yes | Maximum token length for truncation (e.g., 256) |
| val_set_size | int | Yes | Number of samples for validation split (0 for no validation) |
Outputs
| Name | Type | Description |
|---|---|---|
| train_data | Dataset | Tokenized training dataset with input_ids, attention_mask, labels |
| val_data | Dataset or None | Tokenized validation dataset, or None if val_set_size is 0 |
Usage Examples
from common.utils import Prompter, get_train_val_data
from datasets import load_dataset
from transformers import AutoTokenizer
# 1. Load tokenizer and dataset
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
data = load_dataset("yahma/alpaca-cleaned")
# 2. Initialize Prompter with Alpaca template
prompter = Prompter("alpaca")
# 3. Tokenize and split
train_data, val_data = get_train_val_data(
data, tokenizer, prompter,
train_on_inputs=True,
add_eos_token=False,
cutoff_len=256,
val_set_size=2000,
seed=42
)