Implementation:LLMBook zh LLMBook zh github io AutoModelForCausalLM From Pretrained Bitsandbytes
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Compression, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for loading models with bitsandbytes 8-bit or 4-bit quantization via HuggingFace Transformers.
Description
AutoModelForCausalLM.from_pretrained with load_in_8bit=True or load_in_4bit=True loads the model with bitsandbytes quantization. The device_map="auto" flag distributes layers across available GPUs.
This is a Wrapper Doc documenting how the LLMBook repository uses bitsandbytes quantization.
Usage
Use this to load models that exceed available GPU memory at full precision.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/9.3 bitsandbytes实践.py
- Lines: 1-12
Signature
# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
name: str,
device_map: str = "auto",
load_in_8bit: bool = True
)
# 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
name: str,
device_map: str = "auto",
load_in_4bit: bool = True
)
Import
from transformers import AutoModelForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Model ID (e.g., "yulan-team/YuLan-Chat-2-13b-fp16") |
| device_map | str | No | Device placement ("auto") |
| load_in_8bit | bool | No | Enable 8-bit quantization |
| load_in_4bit | bool | No | Enable 4-bit quantization |
Outputs
| Name | Type | Description |
|---|---|---|
| return | PreTrainedModel | Quantized model loaded on GPU with reduced memory |
Usage Examples
import torch
from transformers import AutoModelForCausalLM
name = "yulan-team/YuLan-Chat-2-13b-fp16"
# 8-bit loading
model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
print(f"8-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")
# 4-bit loading
model_4bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_4bit=True)
print(f"4-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment