Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Neuml Txtai HFTrainer Call

From Leeroopedia


Knowledge Sources
Domains Training, NLP
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for end-to-end model fine-tuning provided by the txtai library. HFTrainer.__call__() orchestrates the complete training pipeline from raw data to a fine-tuned (model, tokenizer) tuple in a single invocation.

Description

HFTrainer.__call__() is the primary entry point for training transformer models in txtai. It coordinates every stage of the fine-tuning workflow:

  1. Validates dependencies -- raises ImportError if quantization or LoRA is requested but the peft package is not installed.
  2. Parses training arguments -- calls self.parse(args) to merge user overrides with defaults.
  3. Sets seed -- calls set_seed(args.seed) for reproducibility.
  4. Loads configuration and tokenizer -- calls self.load(base, maxlength) to obtain (config, tokenizer, maxlength).
  5. Sets pad token -- defaults pad_token to eos_token if not already defined.
  6. Prepares data processing -- calls self.prepare() to select the correct tokenizer class (Labels, Questions, Sequences, or Texts) and data collator based on the task string.
  7. Tokenizes datasets -- invokes the processor on training and validation data with optional multiprocessing.
  8. Creates model -- calls self.model() to load the pretrained model with the correct architecture head and optional quantization.
  9. Applies LoRA -- calls self.peft() to optionally wrap the model with PEFT adapters.
  10. Builds HF Trainer -- assembles a HuggingFace Trainer instance with all components.
  11. Runs training -- executes trainer.train(), optionally resuming from a checkpoint.
  12. Evaluates -- runs trainer.evaluate() if validation data was provided.
  13. Saves -- writes model and trainer state if args.should_save is true.
  14. Returns -- puts the model in eval mode and returns (model, tokenizer).

Usage

This is the standard way to fine-tune any supported model type in txtai. It accepts raw data in multiple formats (HuggingFace Dataset, pandas/Polars DataFrame, or an iterable of dicts) and handles all tokenization, model loading, and training internally.

Code Reference

Source Location

  • Repository: txtai
  • File: src/python/txtai/pipeline/train/hftrainer.py (Lines 45-144)

Signature

def __call__(
    self,
    base,
    train,
    validation=None,
    columns=None,
    maxlength=None,
    stride=128,
    task="text-classification",
    prefix=None,
    metrics=None,
    tokenizers=None,
    checkpoint=None,
    quantize=None,
    lora=None,
    **args
):
    """
    Builds a new model using arguments.

    Args:
        base: path to base model, accepts HF hub id, local path or (model, tokenizer) tuple
        train: training data
        validation: validation data
        columns: tuple of columns for text/label mapping
        maxlength: maximum sequence length, defaults to tokenizer.model_max_length
        stride: chunk size for splitting data for QA tasks
        task: model task, defaults to "text-classification"
        prefix: optional source prefix for seq2seq tasks
        metrics: optional function returning evaluation metrics dict
        tokenizers: number of concurrent tokenizers, defaults to None
        checkpoint: resume from checkpoint flag or path
        quantize: quantization configuration
        lora: LoRA configuration
        args: training arguments passed to TrainingArguments

    Returns:
        (model, tokenizer) tuple
    """

Import

from txtai.pipeline import HFTrainer

I/O Contract

Inputs

Name Type Required Description
base str or tuple Yes Path to pretrained model (HuggingFace hub ID or local path), or a (model, tokenizer) tuple from a prior training run.
train Dataset / DataFrame / iterable Yes Training dataset. Supports HuggingFace Dataset, pandas/Polars DataFrame, or an iterable of dicts.
validation Dataset / DataFrame / iterable No Optional validation dataset in the same format as train.
columns tuple No Column name mapping. Defaults depend on task: ("text", None, "label") for classification, ("question", "context", "answers") for QA, ("source", "target") for seq2seq, ("text", None) for LM.
maxlength int No Maximum sequence length. Defaults to tokenizer.model_max_length.
stride int No Token overlap for QA chunking. Default: 128.
task str No Task type. Supported: "text-classification" (default), "question-answering", "sequence-sequence", "language-generation", "language-modeling", "token-detection".
prefix str No Source prefix for seq2seq tasks (e.g., "summarize: ").
metrics callable No Function f(eval_pred) -> dict for computing evaluation metrics.
tokenizers bool / int No Number of concurrent tokenizer workers. True uses os.cpu_count(). None disables multiprocessing.
checkpoint bool / str No Resume from checkpoint. True resumes from latest; a string specifies a checkpoint directory path.
quantize bool / dict / BitsAndBytesConfig No Quantization config. True enables 4-bit NF4 defaults. Requires peft package.
lora bool / dict / LoraConfig No LoRA adapter config. True enables defaults (r=16, lora_alpha=8, target_modules="all-linear"). Requires peft package.
**args keyword args No Any valid TrainingArguments field (e.g., num_train_epochs, learning_rate, per_device_train_batch_size).

Outputs

Name Type Description
result tuple A (model, tokenizer) tuple. The model is in eval mode (weight updates disabled). The model class depends on the task parameter.

Usage Examples

Basic Example: Text Classification

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {"text": "This is great!", "label": 1},
    {"text": "This is terrible.", "label": 0},
]

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="text-classification",
    num_train_epochs=3,
    learning_rate=2e-5,
)

Question Answering

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {
        "question": "What is txtai?",
        "context": "txtai is an all-in-one embeddings database for semantic search.",
        "answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
    }
]

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="question-answering",
    stride=128,
    num_train_epochs=2,
)

QLoRA Fine-Tuning

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {"text": "Explain quantum computing.", "label": 1},
]

model, tokenizer = trainer(
    "meta-llama/Llama-2-7b-hf",
    train,
    task="text-classification",
    quantize=True,
    lora=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
)

Saving to Disk

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="text-classification",
    output_dir="./my-fine-tuned-model",
    save_strategy="epoch",
    num_train_epochs=5,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment