Principle:FMInference FlexLLMGen Model Configuration Resolution

Field	Value
Sources	Repo: FlexLLMGen, Doc: OPT Paper
Domains	Model_Architecture, Configuration
Last Updated	2026-02-09 00:00 GMT

Overview

A configuration resolution mechanism that maps model name strings to architecture-specific hyperparameters (layer count, hidden size, attention heads) for the OPT model family.

Description

Large language models in the OPT family (125M to 175B parameters) share a common architecture but differ in dimensions. The configuration resolution system takes a model name string and returns a frozen dataclass with all architectural parameters needed for memory allocation, weight loading, and inference.

The resolution process handles several practical concerns:

Name normalization -- Strips organization prefixes (e.g., "facebook/") to extract the base model name.
Variant handling -- Recognizes IML (instruction-tuned) variants and maps them to the correct base architecture.
Cross-family support -- Also supports Galactica-30B in addition to the OPT model series.
Override mechanism -- Allows callers to override individual config fields via keyword arguments.

The key architectural parameters that vary across OPT model sizes include:

num_hidden_layers -- Number of Transformer decoder layers (12 for 125M, 96 for 175B).
hidden_size -- Dimensionality of hidden representations (768 for 125M, 12288 for 175B).
n_head -- Number of attention heads (12 for 125M, 96 for 175B).
ffn_embed_dim -- Feed-forward network intermediate dimension (3072 for 125M, 49152 for 175B).

Usage

Use get_opt_config() to resolve a model name (e.g., "facebook/opt-30b") into an OptConfig dataclass before initializing OptLM. The resolved configuration is essential for:

Computing total memory requirements (model_bytes, cache_bytes, hidden_bytes utility methods).
Allocating weight tensors with correct shapes.
Configuring the number of layers in the inference loop.
Setting up attention heads and hidden dimensions.

Theoretical Basis

OPT models follow a standard decoder-only Transformer architecture. Key dimensions scale with model size according to established scaling patterns:

Model	Parameters	Layers	Hidden Size	Heads	FFN Dim
OPT-125M	125M	12	768	12	3072
OPT-1.3B	1.3B	24	2048	32	8192
OPT-6.7B	6.7B	32	4096	32	16384
OPT-30B	30B	48	7168	56	28672
OPT-175B	175B	96	12288	96	49152

The config also provides utility methods for computing memory requirements:

model_bytes() -- Total bytes for all model weight tensors.
cache_bytes() -- Total bytes for the KV cache given a batch size and sequence length.
hidden_bytes() -- Total bytes for hidden state activations given a batch size and sequence length.

These methods enable the cost model to determine optimal offloading policies via linear programming.

Related Pages

Implementation:FMInference_FlexLLMGen_Get_Opt_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment