Implementation:Hpcaitech ColossalAI HuggingFaceModel Inference

Knowledge Sources	ColossalAI
Domains	Evaluation, Distributed_Computing
Last Updated	2026-02-09 00:00 GMT

Overview

Model wrapper for distributed inference with tensor-parallel HuggingFace models across evaluation benchmarks, provided by ColossalEval.

Description

HuggingFaceModel wraps a HuggingFace model with ColossalAI's ShardFormer for tensor-parallel inference. The inference() method processes benchmark datasets in batches, computing logits, losses, and generated outputs based on the task type.

Usage

Create with a model path and shard config, then call inference() on each dataset's data loader.

Code Reference

Source Location

Repository: ColossalAI
File: applications/ColossalEval/colossal_eval/models/huggingface.py
Lines: 39-621

Signature

class HuggingFaceModel:
    def __init__(
        self,
        path: str,
        model_max_length: int = 2048,
        tokenizer_path: Optional[str] = None,
        tokenizer_kwargs: dict = {},
        peft_path: Optional[str] = None,
        model_kwargs: Dict = None,
        prompt_template: Conversation = None,
        batch_size: int = 1,
        logger: DistributedLogger = None,
        shard_config: ShardConfig = None,
    ):
        """
        Args:
            path: HuggingFace model path
            model_max_length: Maximum model context length
            shard_config: ShardConfig for tensor-parallel inference
        """

    def inference(
        self,
        data_loader: DataLoader,
        inference_kwargs: Dict[str, Any],
        debug: bool = False,
    ) -> List[Dict]:
        """Run inference on a dataset, returning results with outputs."""

Import

from colossal_eval.models import HuggingFaceModel

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	HuggingFace model path
shard_config	ShardConfig	No	Tensor parallel sharding config
data_loader	DataLoader	Yes	Benchmark dataset batches
inference_kwargs	Dict	Yes	max_new_tokens, temperature, etc.

Outputs

Name	Type	Description
results	List[Dict]	Inference results with logits, loss, generated output per sample

Usage Examples

from colossal_eval.models import HuggingFaceModel
from colossalai.shardformer import ShardConfig

model = HuggingFaceModel(
    path="meta-llama/Llama-2-7b-hf",
    model_max_length=4096,
    batch_size=8,
    shard_config=ShardConfig(tensor_parallel_size=2),
)

results = model.inference(
    data_loader=mmlu_dataloader,
    inference_kwargs={"max_new_tokens": 5},
)

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_Distributed_Model_Inference

Requires Environment

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment