Implementation:ContextualAI HALOs LM Eval CLI
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for running standardized NLP benchmarks using the EleutherAI LM Evaluation Harness CLI.
Description
The lm_eval CLI (from EleutherAI's lm-evaluation-harness, cloned and installed during environment setup) provides a standardized way to evaluate HuggingFace models on multiple benchmarks. The HALOs framework uses it with the --model hf backend and a curated task list.
The evaluation runs locally on GPU, does not require external API access, and produces detailed results tables with per-task metrics and standard errors.
Usage
Run after training to evaluate model capabilities. Typically used alongside AlpacaEval for a complete evaluation picture.
Code Reference
Source Location
- Repository: External (EleutherAI/lm-evaluation-harness, cloned by install.sh)
- File: CLI tool
lm_eval
Signature
lm_eval \
--model hf \
--model_args pretrained=<model_path> \
--tasks <comma_separated_task_list> \
--batch_size auto
Import
# Installed by install.sh:
# git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
# cd lm-evaluation-harness && pip install -e .
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model | str | Yes | Model backend ('hf' for HuggingFace) |
| --model_args | str | Yes | pretrained=<path> for model location |
| --tasks | str | Yes | Comma-separated task list |
| --batch_size | str | No | 'auto' for automatic batch sizing |
Outputs
| Name | Type | Description |
|---|---|---|
| Results table | stdout/log | Per-task metrics: task name, version, n-shot, metric name, value, stderr |
| Log file | .log | Full evaluation output for later parsing by summarize_metrics.py |
Usage Examples
Standard HALOs Evaluation Suite
lm_eval \
--model hf \
--model_args pretrained=/models/llama3-8B-kto/FINAL \
--tasks winogrande,mmlu,gsm8k_cot,bbh_cot_fewshot,arc_easy,arc_challenge,hellaswag,ifeval \
--batch_size auto \
2>&1 | tee eval_llama3-8B-kto.log
Single Task Evaluation
lm_eval \
--model hf \
--model_args pretrained=/models/llama3-8B-dpo/FINAL \
--tasks mmlu \
--batch_size auto