Implementation:Neuml Txtai HFTrainer Call

Knowledge Sources	txtai txtai Documentation
Domains	Training, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for end-to-end model fine-tuning provided by the txtai library. HFTrainer.__call__() orchestrates the complete training pipeline from raw data to a fine-tuned (model, tokenizer) tuple in a single invocation.

Description

HFTrainer.__call__() is the primary entry point for training transformer models in txtai. It coordinates every stage of the fine-tuning workflow:

Validates dependencies -- raises ImportError if quantization or LoRA is requested but the peft package is not installed.
Parses training arguments -- calls self.parse(args) to merge user overrides with defaults.
Sets seed -- calls set_seed(args.seed) for reproducibility.
Loads configuration and tokenizer -- calls self.load(base, maxlength) to obtain (config, tokenizer, maxlength).
Sets pad token -- defaults pad_token to eos_token if not already defined.
Prepares data processing -- calls self.prepare() to select the correct tokenizer class (Labels, Questions, Sequences, or Texts) and data collator based on the task string.
Tokenizes datasets -- invokes the processor on training and validation data with optional multiprocessing.
Creates model -- calls self.model() to load the pretrained model with the correct architecture head and optional quantization.
Applies LoRA -- calls self.peft() to optionally wrap the model with PEFT adapters.
Builds HF Trainer -- assembles a HuggingFace Trainer instance with all components.
Runs training -- executes trainer.train(), optionally resuming from a checkpoint.
Evaluates -- runs trainer.evaluate() if validation data was provided.
Saves -- writes model and trainer state if args.should_save is true.
Returns -- puts the model in eval mode and returns (model, tokenizer).

Usage

This is the standard way to fine-tune any supported model type in txtai. It accepts raw data in multiple formats (HuggingFace Dataset, pandas/Polars DataFrame, or an iterable of dicts) and handles all tokenization, model loading, and training internally.

Code Reference

Source Location

Repository: txtai
File: src/python/txtai/pipeline/train/hftrainer.py (Lines 45-144)

Signature

def __call__(
    self,
    base,
    train,
    validation=None,
    columns=None,
    maxlength=None,
    stride=128,
    task="text-classification",
    prefix=None,
    metrics=None,
    tokenizers=None,
    checkpoint=None,
    quantize=None,
    lora=None,
    **args
):
    """
    Builds a new model using arguments.

    Args:
        base: path to base model, accepts HF hub id, local path or (model, tokenizer) tuple
        train: training data
        validation: validation data
        columns: tuple of columns for text/label mapping
        maxlength: maximum sequence length, defaults to tokenizer.model_max_length
        stride: chunk size for splitting data for QA tasks
        task: model task, defaults to "text-classification"
        prefix: optional source prefix for seq2seq tasks
        metrics: optional function returning evaluation metrics dict
        tokenizers: number of concurrent tokenizers, defaults to None
        checkpoint: resume from checkpoint flag or path
        quantize: quantization configuration
        lora: LoRA configuration
        args: training arguments passed to TrainingArguments

    Returns:
        (model, tokenizer) tuple
    """

Import

from txtai.pipeline import HFTrainer

I/O Contract

Inputs

Name	Type	Required	Description
base	str or tuple	Yes	Path to pretrained model (HuggingFace hub ID or local path), or a `(model, tokenizer)` tuple from a prior training run.
train	Dataset / DataFrame / iterable	Yes	Training dataset. Supports HuggingFace `Dataset`, pandas/Polars `DataFrame`, or an iterable of dicts.
validation	Dataset / DataFrame / iterable	No	Optional validation dataset in the same format as `train`.
columns	tuple	No	Column name mapping. Defaults depend on task: `("text", None, "label")` for classification, `("question", "context", "answers")` for QA, `("source", "target")` for seq2seq, `("text", None)` for LM.
maxlength	int	No	Maximum sequence length. Defaults to `tokenizer.model_max_length`.
stride	int	No	Token overlap for QA chunking. Default: `128`.
task	str	No	Task type. Supported: `"text-classification"` (default), `"question-answering"`, `"sequence-sequence"`, `"language-generation"`, `"language-modeling"`, `"token-detection"`.
prefix	str	No	Source prefix for seq2seq tasks (e.g., `"summarize: "`).
metrics	callable	No	Function `f(eval_pred) -> dict` for computing evaluation metrics.
tokenizers	bool / int	No	Number of concurrent tokenizer workers. `True` uses `os.cpu_count()`. `None` disables multiprocessing.
checkpoint	bool / str	No	Resume from checkpoint. `True` resumes from latest; a string specifies a checkpoint directory path.
quantize	bool / dict / BitsAndBytesConfig	No	Quantization config. `True` enables 4-bit NF4 defaults. Requires `peft` package.
lora	bool / dict / LoraConfig	No	LoRA adapter config. `True` enables defaults (`r=16, lora_alpha=8, target_modules="all-linear"`). Requires `peft` package.
**args	keyword args	No	Any valid `TrainingArguments` field (e.g., `num_train_epochs`, `learning_rate`, `per_device_train_batch_size`).

Outputs

Name	Type	Description
result	tuple	A `(model, tokenizer)` tuple. The model is in eval mode (weight updates disabled). The model class depends on the task parameter.

Usage Examples

Basic Example: Text Classification

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {"text": "This is great!", "label": 1},
    {"text": "This is terrible.", "label": 0},
]

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="text-classification",
    num_train_epochs=3,
    learning_rate=2e-5,
)

Question Answering

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {
        "question": "What is txtai?",
        "context": "txtai is an all-in-one embeddings database for semantic search.",
        "answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
    }
]

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="question-answering",
    stride=128,
    num_train_epochs=2,
)

QLoRA Fine-Tuning

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

train = [
    {"text": "Explain quantum computing.", "label": 1},
]

model, tokenizer = trainer(
    "meta-llama/Llama-2-7b-hf",
    train,
    task="text-classification",
    quantize=True,
    lora=True,
    num_train_epochs=1,
    per_device_train_batch_size=4,
)

Saving to Disk

from txtai.pipeline import HFTrainer

trainer = HFTrainer()

model, tokenizer = trainer(
    "bert-base-uncased",
    train,
    task="text-classification",
    output_dir="./my-fine-tuned-model",
    save_strategy="epoch",
    num_train_epochs=5,
)

Related Pages

Implements Principle

Principle:Neuml_Txtai_Model_Fine_Tuning

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment