Implementation:Turboderp org Exllamav2 HumanEval Benchmark

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Evaluation, Benchmarking
Last Updated	2026-02-15 00:00 GMT

Overview

CLI evaluation script that runs the HumanEval code-generation benchmark against an ExLlamaV2-quantized model, producing JSONL output and optionally computing pass@k via evaluate_functional_correctness.

Description

humaneval.py is a self-contained command-line tool that loads an EXL2 model through model_init, creates an ExLlamaV2DynamicGenerator with max_batch_size=256, and generates code completions for every HumanEval problem.

Key components:

Argument parser -- Accepts -o/--output (required JSONL path), -cs/--cache_size, -spt/--samples_per_task (default 200), cache quantisation flags (-cq4, -cq6, -cq8), --max_tokens (default 768), -pf/--prompt_format, -temp/--temperature (default 0.6), -topk/--top_k (default 50), -topp/--top_p (default 0.6), -repp/--repetition_penalty (default 1.0), -v/--verbose, and -e/--eval.
Prompt format templates -- A dictionary of five chat/completion templates: raw, granite, llama, llama3, and gemma. Each template inserts the problem text via a Template:Problem placeholder and defines a prefix string for indentation.
Job creation -- For each HumanEval problem, num_samples_per_task ExLlamaV2DynamicJob instances are enqueued with token_healing=True, min_new_tokens=6, and a (problem_id, sample_index) identifier tuple.
Completion collection -- The generator loop checks for EOS or a non-indented trailing line to detect the end of a function body. Completed samples are stored as dictionaries with task_id and completion keys.
Evaluation -- When -e/--eval is passed, the script invokes evaluate_functional_correctness as a subprocess on the output JSONL file.

Usage

Use this script to measure pass@k accuracy of any EXL2-quantized model on the OpenAI HumanEval benchmark. It is suitable for batch evaluation of base and instruction-tuned models across multiple prompt formats.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: eval/humaneval.py
Lines: L1-221

Signature

# CLI entry point -- no importable class; executed directly
parser = argparse.ArgumentParser(description="Run HumanEval evaluation on EXL2 model")
parser.add_argument("-o", "--output", type=str, required=True)
parser.add_argument("-cs", "--cache_size", type=int, default=None)
parser.add_argument("-spt", "--samples_per_task", type=int, default=200)
parser.add_argument("-cq4", "--cache_q4", action="store_true")
parser.add_argument("-cq6", "--cache_q6", action="store_true")
parser.add_argument("-cq8", "--cache_q8", action="store_true")
parser.add_argument("--max_tokens", type=int, default=768)
parser.add_argument("-pf", "--prompt_format", type=str)
parser.add_argument("-temp", "--temperature", type=float, default=0.6)
parser.add_argument("-topk", "--top_k", type=int, default=50)
parser.add_argument("-topp", "--top_p", type=float, default=0.6)
parser.add_argument("-repp", "--repetition_penalty", type=float, default=1.0)
parser.add_argument("-v", "--verbose", action="store_true")
parser.add_argument("-e", "--eval", action="store_true")

Import

# Script executed directly via CLI
python eval/humaneval.py -o results.jsonl -m /path/to/model -pf llama3 -e

I/O Contract

Inputs

Name	Type	Required	Description
-o / --output	str	Yes	Path to output JSONL file for generated samples
-m / --model_dir	str	Yes	Path to EXL2/HuggingFace model directory (via model_init)
-spt / --samples_per_task	int	No (default 200)	Number of completion samples to generate per HumanEval problem
-cq4 / -cq6 / -cq8	flag	No	Use Q4, Q6, or Q8 quantised KV cache respectively
--max_tokens	int	No (default 768)	Maximum number of tokens per completion
-pf / --prompt_format	str	No (default raw completion)	Prompt template: raw, granite, llama, llama3, or gemma
-temp / --temperature	float	No (default 0.6)	Sampling temperature (0 for greedy)
-topk / --top_k	int	No (default 50)	Top-k sampling cutoff
-topp / --top_p	float	No (default 0.6)	Nucleus (top-p) sampling threshold
-repp / --repetition_penalty	float	No (default 1.0)	Token repetition penalty
-v / --verbose	flag	No	Print each completion to the console
-e / --eval	flag	No	Run evaluate_functional_correctness on the output after sampling

Outputs

Name	Type	Description
JSONL file	file	One JSON object per sample with task_id (str) and completion (str) fields, written to the path given by --output
Console summary	text	Progress bars during generation; optional verbose output of each completion
Evaluation results	text	If -e is passed, pass@k metrics printed by evaluate_functional_correctness

Usage Examples

Basic Greedy Evaluation

# Run HumanEval with greedy decoding on a Llama-3 instruct model
python eval/humaneval.py \
    -m /models/llama3-8b-exl2 \
    -o humaneval_llama3.jsonl \
    -pf llama3 \
    -temp 0 \
    -spt 1 \
    -e

Sampling with Q4 Cache

# Generate 200 samples per task using Q4 cache for memory savings
python eval/humaneval.py \
    -m /models/codellama-34b-exl2 \
    -o humaneval_codellama.jsonl \
    -pf raw \
    -cq4 \
    -spt 200 \
    --max_tokens 512 \
    -temp 0.6 \
    -topk 50 \
    -topp 0.6

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Benchmark_Evaluation

Requires Environment

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment