Implementation:LLMBook zh LLMBook zh github io AutoModelForCausalLM From Pretrained Bitsandbytes

Knowledge Sources	LLMBook-zh HuggingFace Quantization
Domains	Deep_Learning, Model_Compression, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for loading models with bitsandbytes 8-bit or 4-bit quantization via HuggingFace Transformers.

Description

AutoModelForCausalLM.from_pretrained with load_in_8bit=True or load_in_4bit=True loads the model with bitsandbytes quantization. The device_map="auto" flag distributes layers across available GPUs.

This is a Wrapper Doc documenting how the LLMBook repository uses bitsandbytes quantization.

Usage

Use this to load models that exceed available GPU memory at full precision.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/9.3 bitsandbytes实践.py
Lines: 1-12

Signature

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
    name: str,
    device_map: str = "auto",
    load_in_8bit: bool = True
)

# 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
    name: str,
    device_map: str = "auto",
    load_in_4bit: bool = True
)

Import

from transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
name	str	Yes	Model ID (e.g., "yulan-team/YuLan-Chat-2-13b-fp16")
device_map	str	No	Device placement ("auto")
load_in_8bit	bool	No	Enable 8-bit quantization
load_in_4bit	bool	No	Enable 4-bit quantization

Outputs

Name	Type	Description
return	PreTrainedModel	Quantized model loaded on GPU with reduced memory

Usage Examples

import torch
from transformers import AutoModelForCausalLM

name = "yulan-team/YuLan-Chat-2-13b-fp16"

# 8-bit loading
model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
print(f"8-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

# 4-bit loading
model_4bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_4bit=True)
print(f"4-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_Bitsandbytes_Quantization

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment