Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Hpcaitech ColossalAI Model Evaluation

From Leeroopedia


Knowledge Sources
Domains LLMs, Evaluation, Benchmarking
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for evaluating language models across multiple benchmarks using ColossalEval's distributed inference and metric computation framework.

Description

This workflow implements a comprehensive model evaluation pipeline that supports multiple benchmarks (MMLU, C-Eval, CMMLU, AGIEval, GSM8K, LongBench, MT-Bench, SafetyBench, and more) with multiple model backends (HuggingFace, vLLM, ChatGLM). The pipeline separates inference from evaluation, allowing distributed inference with tensor parallelism followed by metric computation. Evaluation metrics include accuracy, F1 score, perplexity, BLEU, ROUGE, and GPT-based judging for open-ended questions.

Usage

Execute this workflow when you need to benchmark a trained or fine-tuned language model across standard evaluation datasets. This is typically used after supervised fine-tuning or alignment training to measure model quality before deployment.

Execution Steps

Step 1: Configuration Setup

Create JSON configuration files specifying which models to evaluate, which datasets to use, inference settings (batch size, max tokens), and which metrics to compute for each benchmark.

Key considerations:

  • Inference config defines model paths, model types (HuggingFace/vLLM/ChatGLM), and dataset paths
  • Evaluation config specifies metrics per benchmark (accuracy for MMLU, F1 for GSM8K, etc.)
  • Each dataset class has specific loading and preprocessing requirements
  • Multi-turn benchmarks (MT-Bench) require iterative inference passes

Step 2: Dataset Loading and Preprocessing

Load and preprocess evaluation datasets using dataset-specific loaders. Each benchmark has a dedicated loader class that handles data format conversion, few-shot prompt construction, and answer extraction.

What happens:

  • Dynamically instantiate dataset class based on configuration
  • Load raw data from specified paths
  • Format prompts with few-shot examples where applicable
  • Organize data by category for per-category evaluation
  • Cache preprocessed datasets for reuse

Step 3: Distributed Inference

Run model inference across all evaluation datasets using distributed execution with tensor parallelism and data parallelism. Results are saved per-rank and then merged.

What happens:

  • Initialize ColossalAI distributed environment
  • Configure ProcessGroupMesh for tensor and data parallelism
  • Apply ShardConfig for tensor-parallel model sharding
  • Instantiate model wrapper (HuggingFace, vLLM, or ChatGLM)
  • For each dataset category: create distributed dataloader and run inference
  • Save per-rank results as JSON files
  • Merge per-rank results into unified output using rm_and_merge()

Step 4: Metric Computation

Compute evaluation metrics on the merged inference results using the DatasetEvaluator. Different benchmarks use different metrics based on their evaluation requirements.

Available metrics:

  • Accuracy: Multiple-choice benchmarks (MMLU, C-Eval, CMMLU, AGIEval)
  • F1 Score: Extractive tasks
  • Perplexity: Language modeling quality
  • BLEU/ROUGE: Text generation quality (LongBench)
  • GPT-based judging: Open-ended evaluation (MT-Bench, CValues)
  • Combined metrics: SafetyBench uses safety-specific scoring

Step 5: Results Aggregation and Reporting

Aggregate per-benchmark, per-model results into comparison tables and save formatted output for analysis.

What happens:

  • Load evaluation results for each model-dataset combination
  • Build comparison table across models and benchmarks
  • Format output using tabulate for readable display
  • Save results as both JSON and formatted text files
  • Display summary table to console

Execution Diagram

GitHub URL

Workflow Repository