Workflow:Hpcaitech ColossalAI Model Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation, Benchmarking |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
End-to-end process for evaluating language models across multiple benchmarks using ColossalEval's distributed inference and metric computation framework.
Description
This workflow implements a comprehensive model evaluation pipeline that supports multiple benchmarks (MMLU, C-Eval, CMMLU, AGIEval, GSM8K, LongBench, MT-Bench, SafetyBench, and more) with multiple model backends (HuggingFace, vLLM, ChatGLM). The pipeline separates inference from evaluation, allowing distributed inference with tensor parallelism followed by metric computation. Evaluation metrics include accuracy, F1 score, perplexity, BLEU, ROUGE, and GPT-based judging for open-ended questions.
Usage
Execute this workflow when you need to benchmark a trained or fine-tuned language model across standard evaluation datasets. This is typically used after supervised fine-tuning or alignment training to measure model quality before deployment.
Execution Steps
Step 1: Configuration Setup
Create JSON configuration files specifying which models to evaluate, which datasets to use, inference settings (batch size, max tokens), and which metrics to compute for each benchmark.
Key considerations:
- Inference config defines model paths, model types (HuggingFace/vLLM/ChatGLM), and dataset paths
- Evaluation config specifies metrics per benchmark (accuracy for MMLU, F1 for GSM8K, etc.)
- Each dataset class has specific loading and preprocessing requirements
- Multi-turn benchmarks (MT-Bench) require iterative inference passes
Step 2: Dataset Loading and Preprocessing
Load and preprocess evaluation datasets using dataset-specific loaders. Each benchmark has a dedicated loader class that handles data format conversion, few-shot prompt construction, and answer extraction.
What happens:
- Dynamically instantiate dataset class based on configuration
- Load raw data from specified paths
- Format prompts with few-shot examples where applicable
- Organize data by category for per-category evaluation
- Cache preprocessed datasets for reuse
Step 3: Distributed Inference
Run model inference across all evaluation datasets using distributed execution with tensor parallelism and data parallelism. Results are saved per-rank and then merged.
What happens:
- Initialize ColossalAI distributed environment
- Configure ProcessGroupMesh for tensor and data parallelism
- Apply ShardConfig for tensor-parallel model sharding
- Instantiate model wrapper (HuggingFace, vLLM, or ChatGLM)
- For each dataset category: create distributed dataloader and run inference
- Save per-rank results as JSON files
- Merge per-rank results into unified output using rm_and_merge()
Step 4: Metric Computation
Compute evaluation metrics on the merged inference results using the DatasetEvaluator. Different benchmarks use different metrics based on their evaluation requirements.
Available metrics:
- Accuracy: Multiple-choice benchmarks (MMLU, C-Eval, CMMLU, AGIEval)
- F1 Score: Extractive tasks
- Perplexity: Language modeling quality
- BLEU/ROUGE: Text generation quality (LongBench)
- GPT-based judging: Open-ended evaluation (MT-Bench, CValues)
- Combined metrics: SafetyBench uses safety-specific scoring
Step 5: Results Aggregation and Reporting
Aggregate per-benchmark, per-model results into comparison tables and save formatted output for analysis.
What happens:
- Load evaluation results for each model-dataset combination
- Build comparison table across models and benchmarks
- Format output using tabulate for readable display
- Save results as both JSON and formatted text files
- Display summary table to console