Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:VainF Torch Pruning LLM Pruning Dependencies

From Leeroopedia


Knowledge Sources
Domains LLMs, Model_Compression
Last Updated 2026-02-08 12:00 GMT

Overview

HuggingFace Transformers and Datasets environment required for LLM structural pruning and perplexity evaluation.

Description

This environment extends the core PyTorch environment with the HuggingFace ecosystem packages needed for the LLM structural pruning workflow. The examples/LLMs/prune_llm.py script uses AutoModelForCausalLM and AutoTokenizer from the transformers library to load models like Llama, Phi, and Qwen. The datasets library is used to load WikiText-2 and C4 evaluation datasets for perplexity measurement.

Models are loaded in float16 with device_map="auto" for automatic GPU memory management, and the tokenizer uses the slow Python implementation (use_fast=False) for compatibility.

Usage

Use this environment for the LLM Structural Pruning workflow, specifically when:

  • Loading and pruning large language models (Llama-2, Llama-3, Phi, Qwen, DeepSeek)
  • Evaluating perplexity on WikiText-2 or C4 datasets
  • Saving pruned models in HuggingFace-compatible format via save_pretrained

System Requirements

Category Requirement Notes
OS Linux (recommended) device_map="auto" works best on Linux
Hardware NVIDIA GPU with sufficient VRAM 7B models require ~16GB VRAM in float16
Disk Varies by model Llama-2-7B requires ~14GB for weights download
Network Internet access for model download Or pre-downloaded model cache

Dependencies

Python Packages

  • torch >= 2.0
  • numpy
  • transformers (HuggingFace Transformers)
  • datasets (HuggingFace Datasets)
  • torch-pruning

Credentials

The following environment variables may be needed for gated models:

  • HF_TOKEN or HUGGING_FACE_HUB_TOKEN: HuggingFace API token for accessing gated models (e.g., Llama-2 requires an access request)

Quick Install

pip install torch>=2.0 numpy torch-pruning transformers datasets

Code Evidence

Model loading with float16 and auto device mapping from examples/LLMs/prune_llm.py:250-260:

def get_llm(model_name, max_seq_len=None):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        device_map="auto",
    )

Tokenizer loading with slow backend from examples/LLMs/prune_llm.py:282:

tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False)

WikiText-2 dataset loading from examples/LLMs/prune_llm.py:44-52:

testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')

Common Errors

Error Message Cause Solution
OSError: You are trying to access a gated repo No HF token for gated model Set HF_TOKEN environment variable with an authorized token
OutOfMemoryError during model loading Insufficient VRAM Use a smaller model or ensure device_map="auto" is set for multi-GPU sharding
ValueError: Tokenizer class ... is not supported Tokenizer compatibility issue Try with use_fast=True or install sentencepiece

Compatibility Notes

  • Llama-2: Requires accepting the license on HuggingFace and setting an API token.
  • GQA Models (Qwen, DeepSeek): Pruning ratio must be a multiple of num_key_value_heads / num_attention_heads for HuggingFace compatibility. See the GQA_Head_Pruning_Constraints heuristic.
  • Model Saving: Uses model.save_pretrained() which saves updated config.json with pruned dimensions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment