Environment:VainF Torch Pruning LLM Pruning Dependencies

Knowledge Sources	Torch-Pruning HuggingFace Transformers
Domains	LLMs, Model_Compression
Last Updated	2026-02-08 12:00 GMT

Overview

HuggingFace Transformers and Datasets environment required for LLM structural pruning and perplexity evaluation.

Description

This environment extends the core PyTorch environment with the HuggingFace ecosystem packages needed for the LLM structural pruning workflow. The examples/LLMs/prune_llm.py script uses AutoModelForCausalLM and AutoTokenizer from the transformers library to load models like Llama, Phi, and Qwen. The datasets library is used to load WikiText-2 and C4 evaluation datasets for perplexity measurement.

Models are loaded in float16 with device_map="auto" for automatic GPU memory management, and the tokenizer uses the slow Python implementation (use_fast=False) for compatibility.

Usage

Use this environment for the LLM Structural Pruning workflow, specifically when:

Loading and pruning large language models (Llama-2, Llama-3, Phi, Qwen, DeepSeek)
Evaluating perplexity on WikiText-2 or C4 datasets
Saving pruned models in HuggingFace-compatible format via save_pretrained

System Requirements

Category	Requirement	Notes
OS	Linux (recommended)	device_map="auto" works best on Linux
Hardware	NVIDIA GPU with sufficient VRAM	7B models require ~16GB VRAM in float16
Disk	Varies by model	Llama-2-7B requires ~14GB for weights download
Network	Internet access for model download	Or pre-downloaded model cache

Dependencies

Python Packages

torch >= 2.0
numpy
transformers (HuggingFace Transformers)
datasets (HuggingFace Datasets)
torch-pruning

Credentials

The following environment variables may be needed for gated models:

HF_TOKEN or HUGGING_FACE_HUB_TOKEN: HuggingFace API token for accessing gated models (e.g., Llama-2 requires an access request)

Quick Install

pip install torch>=2.0 numpy torch-pruning transformers datasets

Code Evidence

Model loading with float16 and auto device mapping from examples/LLMs/prune_llm.py:250-260:

def get_llm(model_name, max_seq_len=None):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        device_map="auto",
    )

Tokenizer loading with slow backend from examples/LLMs/prune_llm.py:282:

tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False)

WikiText-2 dataset loading from examples/LLMs/prune_llm.py:44-52:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')

Common Errors

Error Message	Cause	Solution
`OSError: You are trying to access a gated repo`	No HF token for gated model	Set `HF_TOKEN` environment variable with an authorized token
`OutOfMemoryError` during model loading	Insufficient VRAM	Use a smaller model or ensure `device_map="auto"` is set for multi-GPU sharding
`ValueError: Tokenizer class ... is not supported`	Tokenizer compatibility issue	Try with `use_fast=True` or install `sentencepiece`

Compatibility Notes

Llama-2: Requires accepting the license on HuggingFace and setting an API token.
GQA Models (Qwen, DeepSeek): Pruning ratio must be a multiple of num_key_value_heads / num_attention_heads for HuggingFace compatibility. See the GQA_Head_Pruning_Constraints heuristic.
Model Saving: Uses model.save_pretrained() which saves updated config.json with pruned dimensions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment