Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FMInference FlexLLMGen Model Configuration Resolution

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen, Doc: OPT Paper
Domains Model_Architecture, Configuration
Last Updated 2026-02-09 00:00 GMT

Overview

A configuration resolution mechanism that maps model name strings to architecture-specific hyperparameters (layer count, hidden size, attention heads) for the OPT model family.

Description

Large language models in the OPT family (125M to 175B parameters) share a common architecture but differ in dimensions. The configuration resolution system takes a model name string and returns a frozen dataclass with all architectural parameters needed for memory allocation, weight loading, and inference.

The resolution process handles several practical concerns:

  • Name normalization -- Strips organization prefixes (e.g., "facebook/") to extract the base model name.
  • Variant handling -- Recognizes IML (instruction-tuned) variants and maps them to the correct base architecture.
  • Cross-family support -- Also supports Galactica-30B in addition to the OPT model series.
  • Override mechanism -- Allows callers to override individual config fields via keyword arguments.

The key architectural parameters that vary across OPT model sizes include:

  • num_hidden_layers -- Number of Transformer decoder layers (12 for 125M, 96 for 175B).
  • hidden_size -- Dimensionality of hidden representations (768 for 125M, 12288 for 175B).
  • n_head -- Number of attention heads (12 for 125M, 96 for 175B).
  • ffn_embed_dim -- Feed-forward network intermediate dimension (3072 for 125M, 49152 for 175B).

Usage

Use get_opt_config() to resolve a model name (e.g., "facebook/opt-30b") into an OptConfig dataclass before initializing OptLM. The resolved configuration is essential for:

  • Computing total memory requirements (model_bytes, cache_bytes, hidden_bytes utility methods).
  • Allocating weight tensors with correct shapes.
  • Configuring the number of layers in the inference loop.
  • Setting up attention heads and hidden dimensions.

Theoretical Basis

OPT models follow a standard decoder-only Transformer architecture. Key dimensions scale with model size according to established scaling patterns:

Model Parameters Layers Hidden Size Heads FFN Dim
OPT-125M 125M 12 768 12 3072
OPT-1.3B 1.3B 24 2048 32 8192
OPT-6.7B 6.7B 32 4096 32 16384
OPT-30B 30B 48 7168 56 28672
OPT-175B 175B 96 12288 96 49152

The config also provides utility methods for computing memory requirements:

  • model_bytes() -- Total bytes for all model weight tensors.
  • cache_bytes() -- Total bytes for the KV cache given a batch size and sequence length.
  • hidden_bytes() -- Total bytes for hidden state activations given a batch size and sequence length.

These methods enable the cost model to determine optimal offloading policies via linear programming.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment