Implementation:Neuml Txtai HFTrainer Call
| Knowledge Sources | |
|---|---|
| Domains | Training, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for end-to-end model fine-tuning provided by the txtai library. HFTrainer.__call__() orchestrates the complete training pipeline from raw data to a fine-tuned (model, tokenizer) tuple in a single invocation.
Description
HFTrainer.__call__() is the primary entry point for training transformer models in txtai. It coordinates every stage of the fine-tuning workflow:
- Validates dependencies -- raises
ImportErrorif quantization or LoRA is requested but thepeftpackage is not installed. - Parses training arguments -- calls
self.parse(args)to merge user overrides with defaults. - Sets seed -- calls
set_seed(args.seed)for reproducibility. - Loads configuration and tokenizer -- calls
self.load(base, maxlength)to obtain(config, tokenizer, maxlength). - Sets pad token -- defaults
pad_tokentoeos_tokenif not already defined. - Prepares data processing -- calls
self.prepare()to select the correct tokenizer class (Labels,Questions,Sequences, orTexts) and data collator based on the task string. - Tokenizes datasets -- invokes the processor on training and validation data with optional multiprocessing.
- Creates model -- calls
self.model()to load the pretrained model with the correct architecture head and optional quantization. - Applies LoRA -- calls
self.peft()to optionally wrap the model with PEFT adapters. - Builds HF Trainer -- assembles a HuggingFace
Trainerinstance with all components. - Runs training -- executes
trainer.train(), optionally resuming from a checkpoint. - Evaluates -- runs
trainer.evaluate()if validation data was provided. - Saves -- writes model and trainer state if
args.should_saveis true. - Returns -- puts the model in eval mode and returns
(model, tokenizer).
Usage
This is the standard way to fine-tune any supported model type in txtai. It accepts raw data in multiple formats (HuggingFace Dataset, pandas/Polars DataFrame, or an iterable of dicts) and handles all tokenization, model loading, and training internally.
Code Reference
Source Location
- Repository: txtai
- File:
src/python/txtai/pipeline/train/hftrainer.py(Lines 45-144)
Signature
def __call__(
self,
base,
train,
validation=None,
columns=None,
maxlength=None,
stride=128,
task="text-classification",
prefix=None,
metrics=None,
tokenizers=None,
checkpoint=None,
quantize=None,
lora=None,
**args
):
"""
Builds a new model using arguments.
Args:
base: path to base model, accepts HF hub id, local path or (model, tokenizer) tuple
train: training data
validation: validation data
columns: tuple of columns for text/label mapping
maxlength: maximum sequence length, defaults to tokenizer.model_max_length
stride: chunk size for splitting data for QA tasks
task: model task, defaults to "text-classification"
prefix: optional source prefix for seq2seq tasks
metrics: optional function returning evaluation metrics dict
tokenizers: number of concurrent tokenizers, defaults to None
checkpoint: resume from checkpoint flag or path
quantize: quantization configuration
lora: LoRA configuration
args: training arguments passed to TrainingArguments
Returns:
(model, tokenizer) tuple
"""
Import
from txtai.pipeline import HFTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| base | str or tuple | Yes | Path to pretrained model (HuggingFace hub ID or local path), or a (model, tokenizer) tuple from a prior training run.
|
| train | Dataset / DataFrame / iterable | Yes | Training dataset. Supports HuggingFace Dataset, pandas/Polars DataFrame, or an iterable of dicts.
|
| validation | Dataset / DataFrame / iterable | No | Optional validation dataset in the same format as train.
|
| columns | tuple | No | Column name mapping. Defaults depend on task: ("text", None, "label") for classification, ("question", "context", "answers") for QA, ("source", "target") for seq2seq, ("text", None) for LM.
|
| maxlength | int | No | Maximum sequence length. Defaults to tokenizer.model_max_length.
|
| stride | int | No | Token overlap for QA chunking. Default: 128.
|
| task | str | No | Task type. Supported: "text-classification" (default), "question-answering", "sequence-sequence", "language-generation", "language-modeling", "token-detection".
|
| prefix | str | No | Source prefix for seq2seq tasks (e.g., "summarize: ").
|
| metrics | callable | No | Function f(eval_pred) -> dict for computing evaluation metrics.
|
| tokenizers | bool / int | No | Number of concurrent tokenizer workers. True uses os.cpu_count(). None disables multiprocessing.
|
| checkpoint | bool / str | No | Resume from checkpoint. True resumes from latest; a string specifies a checkpoint directory path.
|
| quantize | bool / dict / BitsAndBytesConfig | No | Quantization config. True enables 4-bit NF4 defaults. Requires peft package.
|
| lora | bool / dict / LoraConfig | No | LoRA adapter config. True enables defaults (r=16, lora_alpha=8, target_modules="all-linear"). Requires peft package.
|
| **args | keyword args | No | Any valid TrainingArguments field (e.g., num_train_epochs, learning_rate, per_device_train_batch_size).
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | tuple | A (model, tokenizer) tuple. The model is in eval mode (weight updates disabled). The model class depends on the task parameter.
|
Usage Examples
Basic Example: Text Classification
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
train = [
{"text": "This is great!", "label": 1},
{"text": "This is terrible.", "label": 0},
]
model, tokenizer = trainer(
"bert-base-uncased",
train,
task="text-classification",
num_train_epochs=3,
learning_rate=2e-5,
)
Question Answering
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
train = [
{
"question": "What is txtai?",
"context": "txtai is an all-in-one embeddings database for semantic search.",
"answers": {"text": ["all-in-one embeddings database"], "answer_start": [12]},
}
]
model, tokenizer = trainer(
"bert-base-uncased",
train,
task="question-answering",
stride=128,
num_train_epochs=2,
)
QLoRA Fine-Tuning
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
train = [
{"text": "Explain quantum computing.", "label": 1},
]
model, tokenizer = trainer(
"meta-llama/Llama-2-7b-hf",
train,
task="text-classification",
quantize=True,
lora=True,
num_train_epochs=1,
per_device_train_batch_size=4,
)
Saving to Disk
from txtai.pipeline import HFTrainer
trainer = HFTrainer()
model, tokenizer = trainer(
"bert-base-uncased",
train,
task="text-classification",
output_dir="./my-fine-tuned-model",
save_strategy="epoch",
num_train_epochs=5,
)