Principle:FMInference FlexLLMGen Model Configuration Resolution
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Doc: OPT Paper |
| Domains | Model_Architecture, Configuration |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A configuration resolution mechanism that maps model name strings to architecture-specific hyperparameters (layer count, hidden size, attention heads) for the OPT model family.
Description
Large language models in the OPT family (125M to 175B parameters) share a common architecture but differ in dimensions. The configuration resolution system takes a model name string and returns a frozen dataclass with all architectural parameters needed for memory allocation, weight loading, and inference.
The resolution process handles several practical concerns:
- Name normalization -- Strips organization prefixes (e.g., "facebook/") to extract the base model name.
- Variant handling -- Recognizes IML (instruction-tuned) variants and maps them to the correct base architecture.
- Cross-family support -- Also supports Galactica-30B in addition to the OPT model series.
- Override mechanism -- Allows callers to override individual config fields via keyword arguments.
The key architectural parameters that vary across OPT model sizes include:
- num_hidden_layers -- Number of Transformer decoder layers (12 for 125M, 96 for 175B).
- hidden_size -- Dimensionality of hidden representations (768 for 125M, 12288 for 175B).
- n_head -- Number of attention heads (12 for 125M, 96 for 175B).
- ffn_embed_dim -- Feed-forward network intermediate dimension (3072 for 125M, 49152 for 175B).
Usage
Use get_opt_config() to resolve a model name (e.g., "facebook/opt-30b") into an OptConfig dataclass before initializing OptLM. The resolved configuration is essential for:
- Computing total memory requirements (model_bytes, cache_bytes, hidden_bytes utility methods).
- Allocating weight tensors with correct shapes.
- Configuring the number of layers in the inference loop.
- Setting up attention heads and hidden dimensions.
Theoretical Basis
OPT models follow a standard decoder-only Transformer architecture. Key dimensions scale with model size according to established scaling patterns:
| Model | Parameters | Layers | Hidden Size | Heads | FFN Dim |
|---|---|---|---|---|---|
| OPT-125M | 125M | 12 | 768 | 12 | 3072 |
| OPT-1.3B | 1.3B | 24 | 2048 | 32 | 8192 |
| OPT-6.7B | 6.7B | 32 | 4096 | 32 | 16384 |
| OPT-30B | 30B | 48 | 7168 | 56 | 28672 |
| OPT-175B | 175B | 96 | 12288 | 96 | 49152 |
The config also provides utility methods for computing memory requirements:
- model_bytes() -- Total bytes for all model weight tensors.
- cache_bytes() -- Total bytes for the KV cache given a batch size and sequence length.
- hidden_bytes() -- Total bytes for hidden state activations given a batch size and sequence length.
These methods enable the cost model to determine optimal offloading policies via linear programming.