Implementation:Turboderp org Exllamav2 HumanEval Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmarking |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
CLI evaluation script that runs the HumanEval code-generation benchmark against an ExLlamaV2-quantized model, producing JSONL output and optionally computing pass@k via evaluate_functional_correctness.
Description
humaneval.py is a self-contained command-line tool that loads an EXL2 model through model_init, creates an ExLlamaV2DynamicGenerator with max_batch_size=256, and generates code completions for every HumanEval problem.
Key components:
- Argument parser -- Accepts -o/--output (required JSONL path), -cs/--cache_size, -spt/--samples_per_task (default 200), cache quantisation flags (-cq4, -cq6, -cq8), --max_tokens (default 768), -pf/--prompt_format, -temp/--temperature (default 0.6), -topk/--top_k (default 50), -topp/--top_p (default 0.6), -repp/--repetition_penalty (default 1.0), -v/--verbose, and -e/--eval.
- Prompt format templates -- A dictionary of five chat/completion templates: raw, granite, llama, llama3, and gemma. Each template inserts the problem text via a
Template:Problemplaceholder and defines a prefix string for indentation. - Job creation -- For each HumanEval problem, num_samples_per_task ExLlamaV2DynamicJob instances are enqueued with token_healing=True, min_new_tokens=6, and a (problem_id, sample_index) identifier tuple.
- Completion collection -- The generator loop checks for EOS or a non-indented trailing line to detect the end of a function body. Completed samples are stored as dictionaries with task_id and completion keys.
- Evaluation -- When -e/--eval is passed, the script invokes evaluate_functional_correctness as a subprocess on the output JSONL file.
Usage
Use this script to measure pass@k accuracy of any EXL2-quantized model on the OpenAI HumanEval benchmark. It is suitable for batch evaluation of base and instruction-tuned models across multiple prompt formats.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: eval/humaneval.py
- Lines: L1-221
Signature
# CLI entry point -- no importable class; executed directly
parser = argparse.ArgumentParser(description="Run HumanEval evaluation on EXL2 model")
parser.add_argument("-o", "--output", type=str, required=True)
parser.add_argument("-cs", "--cache_size", type=int, default=None)
parser.add_argument("-spt", "--samples_per_task", type=int, default=200)
parser.add_argument("-cq4", "--cache_q4", action="store_true")
parser.add_argument("-cq6", "--cache_q6", action="store_true")
parser.add_argument("-cq8", "--cache_q8", action="store_true")
parser.add_argument("--max_tokens", type=int, default=768)
parser.add_argument("-pf", "--prompt_format", type=str)
parser.add_argument("-temp", "--temperature", type=float, default=0.6)
parser.add_argument("-topk", "--top_k", type=int, default=50)
parser.add_argument("-topp", "--top_p", type=float, default=0.6)
parser.add_argument("-repp", "--repetition_penalty", type=float, default=1.0)
parser.add_argument("-v", "--verbose", action="store_true")
parser.add_argument("-e", "--eval", action="store_true")
Import
# Script executed directly via CLI
python eval/humaneval.py -o results.jsonl -m /path/to/model -pf llama3 -e
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -o / --output | str | Yes | Path to output JSONL file for generated samples |
| -m / --model_dir | str | Yes | Path to EXL2/HuggingFace model directory (via model_init) |
| -spt / --samples_per_task | int | No (default 200) | Number of completion samples to generate per HumanEval problem |
| -cq4 / -cq6 / -cq8 | flag | No | Use Q4, Q6, or Q8 quantised KV cache respectively |
| --max_tokens | int | No (default 768) | Maximum number of tokens per completion |
| -pf / --prompt_format | str | No (default raw completion) | Prompt template: raw, granite, llama, llama3, or gemma |
| -temp / --temperature | float | No (default 0.6) | Sampling temperature (0 for greedy) |
| -topk / --top_k | int | No (default 50) | Top-k sampling cutoff |
| -topp / --top_p | float | No (default 0.6) | Nucleus (top-p) sampling threshold |
| -repp / --repetition_penalty | float | No (default 1.0) | Token repetition penalty |
| -v / --verbose | flag | No | Print each completion to the console |
| -e / --eval | flag | No | Run evaluate_functional_correctness on the output after sampling |
Outputs
| Name | Type | Description |
|---|---|---|
| JSONL file | file | One JSON object per sample with task_id (str) and completion (str) fields, written to the path given by --output |
| Console summary | text | Progress bars during generation; optional verbose output of each completion |
| Evaluation results | text | If -e is passed, pass@k metrics printed by evaluate_functional_correctness |
Usage Examples
Basic Greedy Evaluation
# Run HumanEval with greedy decoding on a Llama-3 instruct model
python eval/humaneval.py \
-m /models/llama3-8b-exl2 \
-o humaneval_llama3.jsonl \
-pf llama3 \
-temp 0 \
-spt 1 \
-e
Sampling with Q4 Cache
# Generate 200 samples per task using Q4 cache for memory savings
python eval/humaneval.py \
-m /models/codellama-34b-exl2 \
-o humaneval_codellama.jsonl \
-pf raw \
-cq4 \
-spt 200 \
--max_tokens 512 \
-temp 0.6 \
-topk 50 \
-topp 0.6