LLM_Evaluation Utilities
Overview
The Legacy Utils module provides utility functions for the Phoenix Evals subsystem, covering dataset handling, output parsing, media format detection, and progress tracking. These utilities are consumed throughout the classification and generation pipelines.
The module includes download_benchmark_dataset() for fetching Arize evaluation benchmark datasets from cloud storage, snap_to_rail() for normalizing raw LLM output strings to the nearest valid classification label, parse_openai_function_call() and openai_function_call_kwargs() for handling OpenAI function calling output format, get_audio_format_from_base64() and get_image_format_from_base64() for detecting media formats from base64-encoded data via file signature inspection, and get_tqdm_progress_bar_formatter() for consistent progress bar formatting.
The module also defines the NOT_PARSABLE sentinel string used when LLM output cannot be mapped to any valid rail, and constants for SUPPORTED_AUDIO_FORMATS and SUPPORTED_IMAGE_FORMATS.
Code Reference
| Attribute |
Details
|
| Source File |
packages/phoenix-evals/src/phoenix/evals/legacy/utils.py
|
| Repository |
Arize-ai/phoenix
|
| Lines |
330
|
| Module |
phoenix.evals.legacy.utils
|
| Key Symbols |
download_benchmark_dataset(), snap_to_rail(), parse_openai_function_call(), openai_function_call_kwargs(), get_audio_format_from_base64(), get_image_format_from_base64(), get_tqdm_progress_bar_formatter(), printif(), emoji_guard(), NOT_PARSABLE
|
| Dependencies |
pandas, tqdm, base64, json
|
I/O Contract
download_benchmark_dataset()
| Parameter |
Type |
Description
|
task |
str |
Evaluation task name (used in the storage URL path).
|
dataset_name |
str |
Name of the benchmark dataset.
|
| Returns |
pd.DataFrame |
DataFrame loaded from a JSONL file within a ZIP archive fetched from Google Cloud Storage.
|
| Raises |
ValueError |
If the dataset does not exist at the expected URL.
|
snap_to_rail()
| Parameter |
Type |
Description
|
raw_string |
Optional[str] |
The raw LLM output to be matched against rails.
|
rails |
List[str] |
Valid output labels to snap to.
|
verbose |
bool |
If True, prints debug information about snapping.
|
| Returns |
str |
The matched rail string, or "NOT_PARSABLE" if exactly one rail is not found.
|
The function performs case-insensitive matching and checks that exactly one rail is found in the input string. Rails are sorted by length (longest first) to avoid substring conflicts.
parse_openai_function_call()
| Parameter |
Type |
Description
|
raw_output |
str |
Raw JSON output from an OpenAI function call.
|
| Returns |
Tuple[str, Optional[str]] |
Tuple of (unrailed_label, optional_explanation). Falls back to (raw_output, None) on JSON parse failure.
|
openai_function_call_kwargs()
| Parameter |
Type |
Description
|
rails |
List[str] |
Valid classification labels.
|
provide_explanation |
bool |
Whether to include an explanation field in the function schema.
|
| Returns |
Dict[str, Any] |
Dictionary with functions and function_call keys for OpenAI API invocation.
|
The generated function schema is named "record_response" and constrains the response field to the provided rails via an enum.
get_audio_format_from_base64()
| Parameter |
Type |
Description
|
enc_str |
str |
Base64-encoded audio data.
|
| Returns |
Literal["mp3", "wav", "ogg", "flac", "m4a", "aac"] |
Detected audio format based on file signature (magic bytes).
|
| Raises |
ValueError |
If the format cannot be determined or data is too short.
|
get_image_format_from_base64()
| Parameter |
Type |
Description
|
enc_str |
str |
Base64-encoded image data.
|
| Returns |
Literal["png", "jpeg", "jpg", "webp", "heic", "heif", "bmp", "gif", "tiff", "ico"] |
Detected image format based on file signature (magic bytes).
|
| Raises |
ValueError |
If the format cannot be determined or data is too short.
|
Helper Functions
| Function |
Description
|
get_tqdm_progress_bar_formatter(title) |
Returns a tqdm bar_format string with the given title prefix, elapsed/remaining time, and rate display.
|
printif(condition, *args, **kwargs) |
Conditionally prints via tqdm.write() only when condition is True.
|
emoji_guard(emoji, fallback) |
Returns the emoji string on non-Windows systems, or the fallback on Windows (to avoid encoding issues).
|
Constants
| Constant |
Value |
Description
|
NOT_PARSABLE |
"NOT_PARSABLE" |
Sentinel returned when LLM output cannot be snapped to any rail.
|
SUPPORTED_AUDIO_FORMATS |
{"mp3", "wav"} |
Set of supported audio format identifiers.
|
SUPPORTED_IMAGE_FORMATS |
{"png", "jpeg", "jpg", "webp", "heic", "heif", "bmp", "gif", "tiff", "ico"} |
Set of supported image format identifiers.
|
Usage Examples
from phoenix.evals.legacy.utils import (
download_benchmark_dataset,
snap_to_rail,
parse_openai_function_call,
openai_function_call_kwargs,
get_audio_format_from_base64,
)
# Download a benchmark dataset
df = download_benchmark_dataset(
task="relevance",
dataset_name="wiki_qa-train",
)
# Snap raw LLM output to a rail
label = snap_to_rail(
"The answer is clearly relevant to the question.",
rails=["relevant", "unrelated"],
)
# label = "relevant"
# Handle unparsable output
label = snap_to_rail("I'm not sure", rails=["relevant", "unrelated"])
# label = "NOT_PARSABLE"
# Parse OpenAI function call output
import json
raw = json.dumps({"response": "factual", "explanation": "The facts match."})
label, explanation = parse_openai_function_call(raw)
# label = "factual", explanation = "The facts match."
# Generate function calling kwargs
kwargs = openai_function_call_kwargs(
rails=["factual", "hallucinated"],
provide_explanation=True,
)
# kwargs = {"functions": [...], "function_call": {"name": "record_response"}}
Related Pages