Implementation:ContextualAI HALOs LM Eval CLI

Knowledge Sources	ContextualAI HALOs LM Evaluation Harness
Domains	NLP, Evaluation
Last Updated	2026-02-08 03:00 GMT

Overview

Concrete tool for running standardized NLP benchmarks using the EleutherAI LM Evaluation Harness CLI.

Description

The lm_eval CLI (from EleutherAI's lm-evaluation-harness, cloned and installed during environment setup) provides a standardized way to evaluate HuggingFace models on multiple benchmarks. The HALOs framework uses it with the --model hf backend and a curated task list.

The evaluation runs locally on GPU, does not require external API access, and produces detailed results tables with per-task metrics and standard errors.

Usage

Run after training to evaluate model capabilities. Typically used alongside AlpacaEval for a complete evaluation picture.

Code Reference

Source Location

Repository: External (EleutherAI/lm-evaluation-harness, cloned by install.sh)
File: CLI tool lm_eval

Signature

lm_eval \
    --model hf \
    --model_args pretrained=<model_path> \
    --tasks <comma_separated_task_list> \
    --batch_size auto

Import

# Installed by install.sh:
# git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
# cd lm-evaluation-harness && pip install -e .

I/O Contract

Inputs

Name	Type	Required	Description
--model	str	Yes	Model backend ('hf' for HuggingFace)
--model_args	str	Yes	pretrained=<path> for model location
--tasks	str	Yes	Comma-separated task list
--batch_size	str	No	'auto' for automatic batch sizing

Outputs

Name	Type	Description
Results table	stdout/log	Per-task metrics: task name, version, n-shot, metric name, value, stderr
Log file	.log	Full evaluation output for later parsing by summarize_metrics.py

Usage Examples

Standard HALOs Evaluation Suite

lm_eval \
    --model hf \
    --model_args pretrained=/models/llama3-8B-kto/FINAL \
    --tasks winogrande,mmlu,gsm8k_cot,bbh_cot_fewshot,arc_easy,arc_challenge,hellaswag,ifeval \
    --batch_size auto \
    2>&1 | tee eval_llama3-8B-kto.log

Single Task Evaluation

lm_eval \
    --model hf \
    --model_args pretrained=/models/llama3-8B-dpo/FINAL \
    --tasks mmlu \
    --batch_size auto

Related Pages

Implements Principle

Principle:ContextualAI_HALOs_LM_Eval_Benchmarking

Requires Environment

Environment:ContextualAI_HALOs_CUDA_12_1_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment