Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io AutoModelForCausalLM From Pretrained Bitsandbytes

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Compression, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for loading models with bitsandbytes 8-bit or 4-bit quantization via HuggingFace Transformers.

Description

AutoModelForCausalLM.from_pretrained with load_in_8bit=True or load_in_4bit=True loads the model with bitsandbytes quantization. The device_map="auto" flag distributes layers across available GPUs.

This is a Wrapper Doc documenting how the LLMBook repository uses bitsandbytes quantization.

Usage

Use this to load models that exceed available GPU memory at full precision.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/9.3 bitsandbytes实践.py
  • Lines: 1-12

Signature

# 8-bit quantization
model_8bit = AutoModelForCausalLM.from_pretrained(
    name: str,
    device_map: str = "auto",
    load_in_8bit: bool = True
)

# 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
    name: str,
    device_map: str = "auto",
    load_in_4bit: bool = True
)

Import

from transformers import AutoModelForCausalLM

I/O Contract

Inputs

Name Type Required Description
name str Yes Model ID (e.g., "yulan-team/YuLan-Chat-2-13b-fp16")
device_map str No Device placement ("auto")
load_in_8bit bool No Enable 8-bit quantization
load_in_4bit bool No Enable 4-bit quantization

Outputs

Name Type Description
return PreTrainedModel Quantized model loaded on GPU with reduced memory

Usage Examples

import torch
from transformers import AutoModelForCausalLM

name = "yulan-team/YuLan-Chat-2-13b-fp16"

# 8-bit loading
model_8bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_8bit=True)
print(f"8-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

# 4-bit loading
model_4bit = AutoModelForCausalLM.from_pretrained(name, device_map="auto", load_in_4bit=True)
print(f"4-bit memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment