Principle:Hpcaitech ColossalAI Distributed Model Inference
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Distributed_Computing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A distributed inference pattern using tensor parallelism and data parallelism to efficiently evaluate large language models across multiple benchmarks.
Description
Distributed Model Inference splits the evaluation workload across multiple GPUs using a combination of tensor parallelism (for models too large for a single GPU) and data parallelism (to process different data samples concurrently). ColossalEval uses ShardFormer for tensor-parallel model sharding and ProcessGroupMesh for managing the 2D parallel topology.
Usage
Use this for evaluating large models on standard benchmarks (MMLU, GSM8K, etc.) when a single GPU cannot hold the full model.
Theoretical Basis
The 2D parallelism topology:
- TP (Tensor Parallel): Model layers split across tp_size GPUs for memory
- DP (Data Parallel): Data samples split across dp_size groups for throughput
- Total GPUs = tp_size * dp_size