Implementation:Turboderp org Exllamav2 Model Init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | CLI, Configuration, Utilities |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for CLI-based model loading that encapsulates the full initialization sequence behind argparse arguments, provided by exllamav2.
Description
model_init is a utility module that provides two key functions:
- add_args(parser): Registers all model-related command-line arguments with an argparse parser. This includes the model directory path, context length, GPU split configuration, RoPE scaling, cache type, flash attention toggle, and other model loading options.
- init(args): Takes parsed argparse arguments and executes the complete model loading pipeline: config preparation, model construction, cache allocation (with appropriate quantization), weight loading (with auto-split or explicit GPU mapping), and tokenizer initialization.
The module acts as a facade over the individual loading steps (ExLlamaV2Config, ExLlamaV2, ExLlamaV2Cache, load_autosplit, ExLlamaV2Tokenizer), providing a standardized and concise interface for CLI tools.
Usage
Use model_init in any CLI script that needs to load a model:
- Create an argparse parser
- Call model_init.add_args(parser) to register model arguments
- Parse command-line arguments
- Call model_init.init(args) to load the model
- Use the returned model and tokenizer for inference
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/model_init.py
- Lines: L11-29 (add_args), L82-156 (init)
Signature
def add_args(parser: argparse.ArgumentParser):
"""
Add model-related arguments to an argparse parser.
Adds arguments including:
-m, --model_dir : Path to model directory
-l, --length : Context length override
-gs, --gpu_split : VRAM per GPU (comma-separated)
-rs, --rope_scale : RoPE scaling factor
-ra, --rope_alpha : RoPE NTK alpha
-nfa : Disable flash attention
-ct, --cache_type : Cache quantization type
...and more
"""
...
def init(
args: argparse.Namespace,
quiet: bool = False,
allow_auto_split: bool = False,
skip_load: bool = False,
benchmark: bool = False,
max_batch_size: int | None = None,
max_input_len: int | None = None,
max_output_len: int | None = None,
progress: bool = False,
) -> tuple[ExLlamaV2, ExLlamaV2Tokenizer]:
...
Import
from exllamav2 import model_init
I/O Contract
add_args Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| parser | argparse.ArgumentParser | Yes | Argument parser to add model arguments to |
add_args Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Model-related arguments are registered on the parser |
init Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | argparse.Namespace | Yes | Parsed command-line arguments (must include model_dir at minimum) |
| quiet | bool | No (default False) | Suppress loading messages |
| allow_auto_split | bool | No (default False) | Enable automatic multi-GPU distribution |
| skip_load | bool | No (default False) | Initialize config and model objects but do not load weights |
| benchmark | bool | No (default False) | Enable benchmark mode with timing instrumentation |
| max_batch_size | int or None | No (default None) | Override maximum batch size |
| max_input_len | int or None | No (default None) | Override maximum input length |
| max_output_len | int or None | No (default None) | Override maximum output length |
| progress | bool | No (default False) | Show progress bar during model loading |
init Outputs
| Name | Type | Description |
|---|---|---|
| model | ExLlamaV2 | Fully loaded model instance with weights on GPU(s) |
| tokenizer | ExLlamaV2Tokenizer | Initialized tokenizer ready for encoding/decoding |
Usage Examples
Minimal CLI Script
import argparse
from exllamav2 import model_init
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler
# Set up argument parser
parser = argparse.ArgumentParser(description="Simple text generator")
model_init.add_args(parser)
parser.add_argument("-p", "--prompt", type=str, required=True, help="Input prompt")
args = parser.parse_args()
# Load model (one line!)
model, tokenizer = model_init.init(args, allow_auto_split=True)
# Generate text
cache = model.cache # Cache is created internally by init
generator = ExLlamaV2DynamicGenerator(
model=model,
cache=cache,
tokenizer=tokenizer,
)
output = generator.generate(
prompt=args.prompt,
max_new_tokens=200,
gen_settings=ExLlamaV2Sampler.Settings(),
add_bos=True,
)
print(output)
Command-Line Usage
# Basic usage:
python my_script.py -m /path/to/model -p "Once upon a time"
# With context length override:
python my_script.py -m /path/to/model -l 4096 -p "Hello"
# With GPU split (16GB on GPU0, 24GB on GPU1):
python my_script.py -m /path/to/model -gs 16,24 -p "Hello"
# With Q4 cache for memory savings:
python my_script.py -m /path/to/model -ct q4 -p "Hello"
# Disable flash attention:
python my_script.py -m /path/to/model -nfa -p "Hello"
Chat Application Template
import argparse
from exllamav2 import model_init
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
parser = argparse.ArgumentParser(description="Chat with a model")
model_init.add_args(parser)
args = parser.parse_args()
model, tokenizer = model_init.init(args, allow_auto_split=True, progress=True)
cache = model.cache
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
while True:
user_input = input("User: ")
if user_input.lower() in ("quit", "exit"):
break
prompt = f"User: {user_input}\nAssistant:"
input_ids = tokenizer.encode(prompt, add_bos=True)
generator.begin_stream_ex(input_ids, settings)
print("Assistant: ", end="", flush=True)
while True:
result = generator.stream_ex()
print(result["chunk"], end="", flush=True)
if result["eos"]:
break
print()
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment