Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 HumanEval Benchmark

From Leeroopedia
Revision as of 14:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_HumanEval_Benchmark.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Evaluation, Benchmarking
Last Updated 2026-02-15 00:00 GMT

Overview

CLI evaluation script that runs the HumanEval code-generation benchmark against an ExLlamaV2-quantized model, producing JSONL output and optionally computing pass@k via evaluate_functional_correctness.

Description

humaneval.py is a self-contained command-line tool that loads an EXL2 model through model_init, creates an ExLlamaV2DynamicGenerator with max_batch_size=256, and generates code completions for every HumanEval problem.

Key components:

  • Argument parser -- Accepts -o/--output (required JSONL path), -cs/--cache_size, -spt/--samples_per_task (default 200), cache quantisation flags (-cq4, -cq6, -cq8), --max_tokens (default 768), -pf/--prompt_format, -temp/--temperature (default 0.6), -topk/--top_k (default 50), -topp/--top_p (default 0.6), -repp/--repetition_penalty (default 1.0), -v/--verbose, and -e/--eval.
  • Prompt format templates -- A dictionary of five chat/completion templates: raw, granite, llama, llama3, and gemma. Each template inserts the problem text via a Template:Problem placeholder and defines a prefix string for indentation.
  • Job creation -- For each HumanEval problem, num_samples_per_task ExLlamaV2DynamicJob instances are enqueued with token_healing=True, min_new_tokens=6, and a (problem_id, sample_index) identifier tuple.
  • Completion collection -- The generator loop checks for EOS or a non-indented trailing line to detect the end of a function body. Completed samples are stored as dictionaries with task_id and completion keys.
  • Evaluation -- When -e/--eval is passed, the script invokes evaluate_functional_correctness as a subprocess on the output JSONL file.

Usage

Use this script to measure pass@k accuracy of any EXL2-quantized model on the OpenAI HumanEval benchmark. It is suitable for batch evaluation of base and instruction-tuned models across multiple prompt formats.

Code Reference

Source Location

Signature

# CLI entry point -- no importable class; executed directly
parser = argparse.ArgumentParser(description="Run HumanEval evaluation on EXL2 model")
parser.add_argument("-o", "--output", type=str, required=True)
parser.add_argument("-cs", "--cache_size", type=int, default=None)
parser.add_argument("-spt", "--samples_per_task", type=int, default=200)
parser.add_argument("-cq4", "--cache_q4", action="store_true")
parser.add_argument("-cq6", "--cache_q6", action="store_true")
parser.add_argument("-cq8", "--cache_q8", action="store_true")
parser.add_argument("--max_tokens", type=int, default=768)
parser.add_argument("-pf", "--prompt_format", type=str)
parser.add_argument("-temp", "--temperature", type=float, default=0.6)
parser.add_argument("-topk", "--top_k", type=int, default=50)
parser.add_argument("-topp", "--top_p", type=float, default=0.6)
parser.add_argument("-repp", "--repetition_penalty", type=float, default=1.0)
parser.add_argument("-v", "--verbose", action="store_true")
parser.add_argument("-e", "--eval", action="store_true")

Import

# Script executed directly via CLI
python eval/humaneval.py -o results.jsonl -m /path/to/model -pf llama3 -e

I/O Contract

Inputs

Name Type Required Description
-o / --output str Yes Path to output JSONL file for generated samples
-m / --model_dir str Yes Path to EXL2/HuggingFace model directory (via model_init)
-spt / --samples_per_task int No (default 200) Number of completion samples to generate per HumanEval problem
-cq4 / -cq6 / -cq8 flag No Use Q4, Q6, or Q8 quantised KV cache respectively
--max_tokens int No (default 768) Maximum number of tokens per completion
-pf / --prompt_format str No (default raw completion) Prompt template: raw, granite, llama, llama3, or gemma
-temp / --temperature float No (default 0.6) Sampling temperature (0 for greedy)
-topk / --top_k int No (default 50) Top-k sampling cutoff
-topp / --top_p float No (default 0.6) Nucleus (top-p) sampling threshold
-repp / --repetition_penalty float No (default 1.0) Token repetition penalty
-v / --verbose flag No Print each completion to the console
-e / --eval flag No Run evaluate_functional_correctness on the output after sampling

Outputs

Name Type Description
JSONL file file One JSON object per sample with task_id (str) and completion (str) fields, written to the path given by --output
Console summary text Progress bars during generation; optional verbose output of each completion
Evaluation results text If -e is passed, pass@k metrics printed by evaluate_functional_correctness

Usage Examples

Basic Greedy Evaluation

# Run HumanEval with greedy decoding on a Llama-3 instruct model
python eval/humaneval.py \
    -m /models/llama3-8b-exl2 \
    -o humaneval_llama3.jsonl \
    -pf llama3 \
    -temp 0 \
    -spt 1 \
    -e

Sampling with Q4 Cache

# Generate 200 samples per task using Q4 cache for memory savings
python eval/humaneval.py \
    -m /models/codellama-34b-exl2 \
    -o humaneval_codellama.jsonl \
    -pf raw \
    -cq4 \
    -spt 200 \
    --max_tokens 512 \
    -temp 0.6 \
    -topk 50 \
    -topp 0.6

Related Pages

Implements Principle

Requires Environment

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment