Principle:Neuml Txtai Base Model Configuration

Knowledge Sources	txtai txtai Documentation BERT
Domains	Deep_Learning, Training, NLP
Last Updated	2026-02-09 00:00 GMT

Overview

Base model configuration is the process of selecting, loading, and optionally modifying a pretrained transformer model so that it is ready for fine-tuning on a downstream task. This includes choosing the correct model architecture for the task, applying quantization for memory-efficient training, and injecting parameter-efficient adapters such as LoRA.

Description

Fine-tuning begins with a pretrained checkpoint. The practitioner must make several configuration decisions:

Architecture selection -- different tasks require different model heads. A text-classification task needs a sequence classification head, while a question-answering task needs a span-extraction head. The mapping from task string to AutoModel class determines which architecture is instantiated.
Label configuration -- for classification tasks, the model configuration must know the number of output labels so the classification head has the correct dimensionality.
Quantization -- large models can be loaded in reduced precision (e.g., 4-bit via BitsAndBytes) to fit within GPU memory constraints. This trades a small amount of accuracy for a dramatic reduction in VRAM usage.
LoRA adapter injection -- instead of updating all model parameters, a LoRA (Low-Rank Adaptation) adapter freezes the base weights and injects small trainable rank-decomposition matrices. This reduces the number of trainable parameters by orders of magnitude while preserving most of the model's capacity.

These decisions are interdependent: quantization is typically combined with LoRA (QLoRA pattern), and the choice of task determines both the model class and the LoRA task type.

Usage

Base model configuration is needed whenever a practitioner wants to:

Fine-tune a pretrained model on a new classification, QA, generation, or seq2seq task.
Train a very large model on limited GPU hardware using 4-bit quantization with LoRA adapters.
Resume training from an existing (model, tokenizer) tuple produced by a prior training run.
Switch between different task heads (e.g., moving from text classification to causal language modeling) while reusing the same pretrained backbone.

Theoretical Basis

The principle builds on transfer learning theory: a model pretrained on a large general corpus captures universal language representations that can be specialized for downstream tasks by replacing or adding a task-specific output head and fine-tuning on task-specific data.

Task-to-architecture mapping (pseudocode):

FUNCTION select_model(task, base_path, config, num_labels, quantization_config):
    IF num_labels IS NOT None:
        config.num_labels = num_labels

    SWITCH task:
        CASE "language-generation":
            RETURN AutoModelForCausalLM.from_pretrained(base_path, config, quantization_config)
        CASE "language-modeling":
            RETURN AutoModelForMaskedLM.from_pretrained(base_path, config, quantization_config)
        CASE "question-answering":
            RETURN AutoModelForQuestionAnswering.from_pretrained(base_path, config, quantization_config)
        CASE "sequence-sequence":
            RETURN AutoModelForSeq2SeqLM.from_pretrained(base_path, config, quantization_config)
        CASE "token-detection":
            RETURN TokenDetection(MaskedLM(base_path), PreTraining(base_path), tokenizer)
        DEFAULT:  # "text-classification"
            RETURN AutoModelForSequenceClassification.from_pretrained(base_path, config, quantization_config)

Quantization theory: Quantization maps 32-bit or 16-bit floating-point weights to lower-precision representations (8-bit, 4-bit). The NF4 (Normal Float 4-bit) data type, combined with double quantization, provides near-lossless compression for normally distributed transformer weights. When paired with LoRA, only the small adapter matrices are trained in full precision, while the frozen base weights remain in 4-bit form.

LoRA theory: For a pretrained weight matrix W of dimension d x k, LoRA decomposes the update as W + delta_W = W + B * A where B is d x r and A is r x k with r << min(d, k). This reduces the trainable parameter count from d * k to r * (d + k) while preserving the model's expressive power.

Related Pages

Implemented By

Implementation:Neuml_Txtai_HFTrainer_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment