Environment:VainF Torch Pruning LLM Pruning Dependencies
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Compression |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
HuggingFace Transformers and Datasets environment required for LLM structural pruning and perplexity evaluation.
Description
This environment extends the core PyTorch environment with the HuggingFace ecosystem packages needed for the LLM structural pruning workflow. The examples/LLMs/prune_llm.py script uses AutoModelForCausalLM and AutoTokenizer from the transformers library to load models like Llama, Phi, and Qwen. The datasets library is used to load WikiText-2 and C4 evaluation datasets for perplexity measurement.
Models are loaded in float16 with device_map="auto" for automatic GPU memory management, and the tokenizer uses the slow Python implementation (use_fast=False) for compatibility.
Usage
Use this environment for the LLM Structural Pruning workflow, specifically when:
- Loading and pruning large language models (Llama-2, Llama-3, Phi, Qwen, DeepSeek)
- Evaluating perplexity on WikiText-2 or C4 datasets
- Saving pruned models in HuggingFace-compatible format via
save_pretrained
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended) | device_map="auto" works best on Linux |
| Hardware | NVIDIA GPU with sufficient VRAM | 7B models require ~16GB VRAM in float16 |
| Disk | Varies by model | Llama-2-7B requires ~14GB for weights download |
| Network | Internet access for model download | Or pre-downloaded model cache |
Dependencies
Python Packages
torch>= 2.0numpytransformers(HuggingFace Transformers)datasets(HuggingFace Datasets)torch-pruning
Credentials
The following environment variables may be needed for gated models:
HF_TOKENorHUGGING_FACE_HUB_TOKEN: HuggingFace API token for accessing gated models (e.g., Llama-2 requires an access request)
Quick Install
pip install torch>=2.0 numpy torch-pruning transformers datasets
Code Evidence
Model loading with float16 and auto device mapping from examples/LLMs/prune_llm.py:250-260:
def get_llm(model_name, max_seq_len=None):
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
)
Tokenizer loading with slow backend from examples/LLMs/prune_llm.py:282:
tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False)
WikiText-2 dataset loading from examples/LLMs/prune_llm.py:44-52:
testdata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
testenc = tokenizer("\n\n".join(testdata['text']), return_tensors='pt')
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
OSError: You are trying to access a gated repo |
No HF token for gated model | Set HF_TOKEN environment variable with an authorized token
|
OutOfMemoryError during model loading |
Insufficient VRAM | Use a smaller model or ensure device_map="auto" is set for multi-GPU sharding
|
ValueError: Tokenizer class ... is not supported |
Tokenizer compatibility issue | Try with use_fast=True or install sentencepiece
|
Compatibility Notes
- Llama-2: Requires accepting the license on HuggingFace and setting an API token.
- GQA Models (Qwen, DeepSeek): Pruning ratio must be a multiple of
num_key_value_heads / num_attention_headsfor HuggingFace compatibility. See the GQA_Head_Pruning_Constraints heuristic. - Model Saving: Uses
model.save_pretrained()which saves updated config.json with pruned dimensions.