Workflow:Hpcaitech ColossalAI Model Evaluation

Knowledge Sources	ColossalAI ColossalEval README
Domains	LLMs, Evaluation, Benchmarking
Last Updated	2026-02-09 03:00 GMT

Overview

End-to-end process for evaluating language models across multiple benchmarks using ColossalEval's distributed inference and metric computation framework.

Description

This workflow implements a comprehensive model evaluation pipeline that supports multiple benchmarks (MMLU, C-Eval, CMMLU, AGIEval, GSM8K, LongBench, MT-Bench, SafetyBench, and more) with multiple model backends (HuggingFace, vLLM, ChatGLM). The pipeline separates inference from evaluation, allowing distributed inference with tensor parallelism followed by metric computation. Evaluation metrics include accuracy, F1 score, perplexity, BLEU, ROUGE, and GPT-based judging for open-ended questions.

Usage

Execute this workflow when you need to benchmark a trained or fine-tuned language model across standard evaluation datasets. This is typically used after supervised fine-tuning or alignment training to measure model quality before deployment.

Execution Steps

Step 1: Configuration Setup

Create JSON configuration files specifying which models to evaluate, which datasets to use, inference settings (batch size, max tokens), and which metrics to compute for each benchmark.

Key considerations:

Inference config defines model paths, model types (HuggingFace/vLLM/ChatGLM), and dataset paths
Evaluation config specifies metrics per benchmark (accuracy for MMLU, F1 for GSM8K, etc.)
Each dataset class has specific loading and preprocessing requirements
Multi-turn benchmarks (MT-Bench) require iterative inference passes

Step 2: Dataset Loading and Preprocessing

Load and preprocess evaluation datasets using dataset-specific loaders. Each benchmark has a dedicated loader class that handles data format conversion, few-shot prompt construction, and answer extraction.

What happens:

Dynamically instantiate dataset class based on configuration
Load raw data from specified paths
Format prompts with few-shot examples where applicable
Organize data by category for per-category evaluation
Cache preprocessed datasets for reuse

Step 3: Distributed Inference

Run model inference across all evaluation datasets using distributed execution with tensor parallelism and data parallelism. Results are saved per-rank and then merged.

What happens:

Initialize ColossalAI distributed environment
Configure ProcessGroupMesh for tensor and data parallelism
Apply ShardConfig for tensor-parallel model sharding
Instantiate model wrapper (HuggingFace, vLLM, or ChatGLM)
For each dataset category: create distributed dataloader and run inference
Save per-rank results as JSON files
Merge per-rank results into unified output using rm_and_merge()

Step 4: Metric Computation

Compute evaluation metrics on the merged inference results using the DatasetEvaluator. Different benchmarks use different metrics based on their evaluation requirements.

Available metrics:

Accuracy: Multiple-choice benchmarks (MMLU, C-Eval, CMMLU, AGIEval)
F1 Score: Extractive tasks
Perplexity: Language modeling quality
BLEU/ROUGE: Text generation quality (LongBench)
GPT-based judging: Open-ended evaluation (MT-Bench, CValues)
Combined metrics: SafetyBench uses safety-specific scoring

Step 5: Results Aggregation and Reporting

Aggregate per-benchmark, per-model results into comparison tables and save formatted output for analysis.

What happens:

Load evaluation results for each model-dataset combination
Build comparison table across models and benchmarks
Format output using tabulate for readable display
Save results as both JSON and formatted text files
Display summary table to console

Execution Diagram

GitHub URL

Workflow Repository