Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs LM Eval CLI

From Leeroopedia


Knowledge Sources
Domains NLP, Evaluation
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for running standardized NLP benchmarks using the EleutherAI LM Evaluation Harness CLI.

Description

The lm_eval CLI (from EleutherAI's lm-evaluation-harness, cloned and installed during environment setup) provides a standardized way to evaluate HuggingFace models on multiple benchmarks. The HALOs framework uses it with the --model hf backend and a curated task list.

The evaluation runs locally on GPU, does not require external API access, and produces detailed results tables with per-task metrics and standard errors.

Usage

Run after training to evaluate model capabilities. Typically used alongside AlpacaEval for a complete evaluation picture.

Code Reference

Source Location

  • Repository: External (EleutherAI/lm-evaluation-harness, cloned by install.sh)
  • File: CLI tool lm_eval

Signature

lm_eval \
    --model hf \
    --model_args pretrained=<model_path> \
    --tasks <comma_separated_task_list> \
    --batch_size auto

Import

# Installed by install.sh:
# git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
# cd lm-evaluation-harness && pip install -e .

I/O Contract

Inputs

Name Type Required Description
--model str Yes Model backend ('hf' for HuggingFace)
--model_args str Yes pretrained=<path> for model location
--tasks str Yes Comma-separated task list
--batch_size str No 'auto' for automatic batch sizing

Outputs

Name Type Description
Results table stdout/log Per-task metrics: task name, version, n-shot, metric name, value, stderr
Log file .log Full evaluation output for later parsing by summarize_metrics.py

Usage Examples

Standard HALOs Evaluation Suite

lm_eval \
    --model hf \
    --model_args pretrained=/models/llama3-8B-kto/FINAL \
    --tasks winogrande,mmlu,gsm8k_cot,bbh_cot_fewshot,arc_easy,arc_challenge,hellaswag,ifeval \
    --batch_size auto \
    2>&1 | tee eval_llama3-8B-kto.log

Single Task Evaluation

lm_eval \
    --model hf \
    --model_args pretrained=/models/llama3-8B-dpo/FINAL \
    --tasks mmlu \
    --batch_size auto

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment