Principle:Neuml Txtai Base Model Configuration
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Training, NLP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Base model configuration is the process of selecting, loading, and optionally modifying a pretrained transformer model so that it is ready for fine-tuning on a downstream task. This includes choosing the correct model architecture for the task, applying quantization for memory-efficient training, and injecting parameter-efficient adapters such as LoRA.
Description
Fine-tuning begins with a pretrained checkpoint. The practitioner must make several configuration decisions:
- Architecture selection -- different tasks require different model heads. A text-classification task needs a sequence classification head, while a question-answering task needs a span-extraction head. The mapping from task string to
AutoModelclass determines which architecture is instantiated. - Label configuration -- for classification tasks, the model configuration must know the number of output labels so the classification head has the correct dimensionality.
- Quantization -- large models can be loaded in reduced precision (e.g., 4-bit via BitsAndBytes) to fit within GPU memory constraints. This trades a small amount of accuracy for a dramatic reduction in VRAM usage.
- LoRA adapter injection -- instead of updating all model parameters, a LoRA (Low-Rank Adaptation) adapter freezes the base weights and injects small trainable rank-decomposition matrices. This reduces the number of trainable parameters by orders of magnitude while preserving most of the model's capacity.
These decisions are interdependent: quantization is typically combined with LoRA (QLoRA pattern), and the choice of task determines both the model class and the LoRA task type.
Usage
Base model configuration is needed whenever a practitioner wants to:
- Fine-tune a pretrained model on a new classification, QA, generation, or seq2seq task.
- Train a very large model on limited GPU hardware using 4-bit quantization with LoRA adapters.
- Resume training from an existing (model, tokenizer) tuple produced by a prior training run.
- Switch between different task heads (e.g., moving from text classification to causal language modeling) while reusing the same pretrained backbone.
Theoretical Basis
The principle builds on transfer learning theory: a model pretrained on a large general corpus captures universal language representations that can be specialized for downstream tasks by replacing or adding a task-specific output head and fine-tuning on task-specific data.
Task-to-architecture mapping (pseudocode):
FUNCTION select_model(task, base_path, config, num_labels, quantization_config):
IF num_labels IS NOT None:
config.num_labels = num_labels
SWITCH task:
CASE "language-generation":
RETURN AutoModelForCausalLM.from_pretrained(base_path, config, quantization_config)
CASE "language-modeling":
RETURN AutoModelForMaskedLM.from_pretrained(base_path, config, quantization_config)
CASE "question-answering":
RETURN AutoModelForQuestionAnswering.from_pretrained(base_path, config, quantization_config)
CASE "sequence-sequence":
RETURN AutoModelForSeq2SeqLM.from_pretrained(base_path, config, quantization_config)
CASE "token-detection":
RETURN TokenDetection(MaskedLM(base_path), PreTraining(base_path), tokenizer)
DEFAULT: # "text-classification"
RETURN AutoModelForSequenceClassification.from_pretrained(base_path, config, quantization_config)
Quantization theory: Quantization maps 32-bit or 16-bit floating-point weights to lower-precision representations (8-bit, 4-bit). The NF4 (Normal Float 4-bit) data type, combined with double quantization, provides near-lossless compression for normally distributed transformer weights. When paired with LoRA, only the small adapter matrices are trained in full precision, while the frozen base weights remain in 4-bit form.
LoRA theory: For a pretrained weight matrix W of dimension d x k, LoRA decomposes the update as W + delta_W = W + B * A where B is d x r and A is r x k with r << min(d, k). This reduces the trainable parameter count from d * k to r * (d + k) while preserving the model's expressive power.