Implementation:InternLM Lmdeploy Eval Chat Config
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Configuration, Benchmarking |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
An OpenCompass evaluation configuration file that defines model configurations for benchmarking chat models across multiple LLM families (InternLM, Llama, Qwen, Gemma, etc.) using both TurboMind and PyTorch backends with various quantization settings.
Description
The eval_chat_config.py file is a Python-based configuration consumed by the OpenCompass evaluation framework. It defines:
- Dataset imports: Imports evaluation datasets including BBH, C-Eval, CMMLU, CrowS-Pairs, GaokaoBench, GPQA, GSM8K, HellaSwag, HumanEval, IFEval, MATH, MBPP, MMLU, MMLU-Pro, NQ, RACE, TheoremQA, TriviaQA, and Winogrande.
- Model configurations: Creates deepcopy-based configurations for each model variant:
- TurboMind backend: base, 4-bit AWQ quantization, KV-cache INT4, KV-cache INT8
- PyTorch backend: base, W8A8 quantization
- Models covered include InternLM2/2.5/3, Qwen 1.5/2/2.5/3 (including MoE variants up to 235B), Llama 2/3/3.1, Gemma 2, Baichuan 2, and Mixtral
- Configuration updates: Programmatic loops that set backend types, quantization formats, batch sizes, tensor parallelism, and abbreviation strings based on naming conventions (e.g.,
_4bits,_kvint4,_w8a8). - Summarizer: A comprehensive summarizer configuration listing all dataset abbreviations and summary groups for result aggregation.
Key constants: MAX_SESSION_LEN = 2048, MAX_NEW_TOKENS = 1024.
Usage
This configuration is consumed by the evaluate function in action_tools.py, which copies it, appends dataset and model selections, and passes it to the opencompass CLI.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: .github/scripts/eval_chat_config.py
- Lines: 1-455
Signature
# Configuration variables (not functions)
MAX_SESSION_LEN = 2048
MAX_NEW_TOKENS = 1024
# Example model config pattern:
turbomind_internlm2_chat_7b = deepcopy(*lmdeploy_internlm2_chat_7b)
pytorch_internlm2_chat_7b = deepcopy(*lmdeploy_internlm2_chat_7b)
# Base model for Qwen3 family:
base_model = dict(
type=TurboMindModelwithChatTemplate,
engine_config=dict(session_len=32768, max_batch_size=256),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=32768),
...
)
# Summarizer configuration
summarizer = dict(dataset_abbrs=[...], summary_groups=[...])
Import
from copy import deepcopy
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| OpenCompass base configs | Python modules | Yes | Imported dataset and model configs from opencompass.configs |
| Model path conventions | str | Yes | Model paths following HuggingFace naming (e.g., "Qwen/Qwen3-32B") |
Outputs
| Name | Type | Description |
|---|---|---|
| Model config dicts | dict | Per-model configuration dictionaries consumed by OpenCompass |
| summarizer | dict | Summarizer configuration for aggregating evaluation results |
| datasets | list | Appended at runtime by action_tools.py evaluate function |
Usage Examples
# This file is used indirectly via action_tools.py:
# python .github/scripts/action_tools.py evaluate \
# --models '["turbomind_internlm2_chat_7b"]' \
# --datasets '["mmlu_datasets"]' \
# --workspace /tmp/eval \
# --evaluate_type chat
# Or directly with opencompass:
# opencompass .github/scripts/eval_chat_config.py -w /tmp/work_dir