Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Intel Ipex llm Prompter And Get Train Val Data

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tools for prompt formatting and dataset tokenization provided by the IPEX-LLM common finetuning utilities.

Description

The Prompter class loads JSON prompt templates (e.g., Alpaca format) and formats instruction-input-output triples into full prompts. The get_train_val_data function tokenizes a dataset using the Prompter, handles input masking for loss computation, appends EOS tokens, and splits into train/validation sets.

Usage

Import these when preparing instruction-following datasets for LoRA or QLoRA fine-tuning. The Prompter handles template-based formatting, while get_train_val_data handles the full tokenization pipeline.

Code Reference

Source Location

  • Repository: IPEX-LLM
  • File: python/llm/example/GPU/LLM-Finetuning/common/utils/prompter.py (Prompter, lines 40-83)
  • File: python/llm/example/GPU/LLM-Finetuning/common/utils/util.py (get_train_val_data, lines 78-138)

Signature

class Prompter(object):
    def __init__(self, template_name: str = "", verbose: bool = False):
        """Load a prompt template from common/templates/{template_name}.json"""

    def generate_prompt(
        self,
        instruction: str,
        input: Union[None, str] = None,
        label: Union[None, str] = None,
    ) -> str:
        """Format instruction/input/label into a full prompt string."""

    def get_response(self, output: str) -> str:
        """Extract the response portion from model output."""


def get_train_val_data(
    data,
    tokenizer,
    prompter,
    train_on_inputs: bool,
    add_eos_token: bool,
    cutoff_len: int,
    val_set_size: int,
    seed: int = 42
) -> Tuple[Dataset, Dataset]:
    """Tokenize and split dataset into train and validation sets."""

Import

from common.utils import Prompter, get_train_val_data

I/O Contract

Inputs

Name Type Required Description
template_name str No Prompt template name (default "alpaca"), loads from templates/ directory
data DatasetDict Yes HuggingFace dataset with "train" split containing instruction/input/output columns
tokenizer PreTrainedTokenizer Yes Tokenizer for the target model
prompter Prompter Yes Initialized Prompter instance for formatting
train_on_inputs bool Yes Whether to include input tokens in loss computation
add_eos_token bool Yes Whether to append EOS token to tokenized sequences
cutoff_len int Yes Maximum token length for truncation (e.g., 256)
val_set_size int Yes Number of samples for validation split (0 for no validation)

Outputs

Name Type Description
train_data Dataset Tokenized training dataset with input_ids, attention_mask, labels
val_data Dataset or None Tokenized validation dataset, or None if val_set_size is 0

Usage Examples

from common.utils import Prompter, get_train_val_data
from datasets import load_dataset
from transformers import AutoTokenizer

# 1. Load tokenizer and dataset
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
data = load_dataset("yahma/alpaca-cleaned")

# 2. Initialize Prompter with Alpaca template
prompter = Prompter("alpaca")

# 3. Tokenize and split
train_data, val_data = get_train_val_data(
    data, tokenizer, prompter,
    train_on_inputs=True,
    add_eos_token=False,
    cutoff_len=256,
    val_set_size=2000,
    seed=42
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment