Environment:EvolvingLMMs Lab Lmms eval API Credentials Environment

Knowledge Sources	lmms-eval
Domains	Infrastructure, API_Integration
Last Updated	2026-02-14 00:00 GMT

Overview

API credentials and environment variables required for external service integrations including HuggingFace Hub, OpenAI GPT judges, W&B logging, and cloud model providers.

Description

This environment defines all API keys, tokens, and configuration environment variables used by the lmms-eval framework. These are needed for: (1) accessing gated models and datasets on HuggingFace Hub, (2) using OpenAI or Azure GPT models as evaluation judges, (3) logging evaluation results to Weights & Biases, and (4) running evaluations against cloud-hosted model APIs (Gemini, Reka, xAI Grok, etc.). The framework uses python-dotenv to load credentials from .env files.

Usage

Use this environment when running evaluations that require external API access. This includes tasks using GPT-based judges (e.g., ActivityNetQA, VDC, WildVisionBench), logging to W&B, pushing results to HuggingFace Hub, or evaluating cloud-hosted models.

System Requirements

Category	Requirement	Notes
Network	Internet access	Required for API calls to external services
Credentials	API keys in environment or .env file	Never commit actual keys to source control

Dependencies

Python Packages

python-dotenv — Loads .env files automatically
openai — OpenAI API client for GPT judges
wandb >= 0.16.0 — Weights & Biases logging
httpx >= 0.23.3 — HTTP client for API providers

Credentials

NEVER store actual secret values in code or documentation. The following environment variables must be set:

HuggingFace:

HF_TOKEN: HuggingFace API token for accessing gated models/datasets and pushing results to Hub.

OpenAI / Azure:

OPENAI_API_KEY: OpenAI API key for GPT-based evaluation judges and direct model evaluation.
AZURE_API_KEY: Azure OpenAI API key (alternative to direct OpenAI).

Logging:

WANDB_API_KEY: Weights & Biases API key for experiment logging.

Task-Specific:

VIESCORE_API_KEY: VIEScore API key for GEdit-Bench image editing evaluation.

Framework Configuration:

LMMS_EVAL_HOME: Cache directory path (default: ~/.cache/lmms-eval).
LMMS_EVAL_USE_CACHE: Enable response caching (default: "False").
LMMS_EVAL_PLUGINS: Comma-separated list of plugin package paths.
LMMS_EVAL_SHUFFLE_DOCS: Shuffle dataset documents (used by some tasks).
VERBOSITY: Logging level (DEBUG, INFO, WARNING, ERROR).

Quick Install

# Create a .env file (never commit this!)
cat > .env << 'EOF'
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxx
EOF

# Or export directly
export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx

Code Evidence

HF_TOKEN usage from lmms_eval/__main__.py:557-558:

if os.environ.get("HF_TOKEN", None):
    args.hf_hub_log_args += f",token={os.environ.get('HF_TOKEN')}"

Cache directory configuration from lmms_eval/api/model.py:22-23:

LMMS_EVAL_HOME = os.path.expanduser(os.getenv("LMMS_EVAL_HOME", "~/.cache/lmms-eval"))
LMMS_EVAL_USE_CACHE = os.getenv("LMMS_EVAL_USE_CACHE", "False")

Plugin loading from lmms_eval/__main__.py:595-599:

if os.environ.get("LMMS_EVAL_PLUGINS", None):
    args.include_path = [args.include_path] if args.include_path else []
    for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
        package_tasks_location = importlib.util.find_spec(f"{plugin}.tasks").submodule_search_locations[0]
        args.include_path.append(package_tasks_location)

Dotenv loading in model providers from lmms_eval/models/chat/async_openai.py:20,27:

from dotenv import load_dotenv
load_dotenv(verbose=True)

LLM judge retry configuration from lmms_eval/llm_judge/protocol.py:4-19:

DEFAULT_NUM_RETRIES = 5
DEFAULT_RETRY_DELAY = 10  # seconds

Common Errors

Error Message	Cause	Solution
`401 Unauthorized` from HuggingFace	Missing or invalid HF_TOKEN	Set `HF_TOKEN` with a valid token from huggingface.co/settings/tokens
`openai.AuthenticationError`	Missing OPENAI_API_KEY	Set `OPENAI_API_KEY` environment variable
GPT judge returns empty response	API rate limiting	Framework retries 5 times with 10s delay; increase API tier if persistent
`wandb: ERROR` login failure	Missing WANDB_API_KEY	Run `wandb login` or set `WANDB_API_KEY`

Compatibility Notes

dotenv files: The framework loads .env files automatically via python-dotenv. Place the file in the working directory.
GPT judges: Many task benchmarks (ActivityNetQA, VDC, WildVisionBench, etc.) require OPENAI_API_KEY for GPT-based evaluation. Without it, these tasks will fail.
W&B logging: Optional. Only required when using --wandb_args CLI flag.
HF Hub push: Only required when using push_results_to_hub or push_samples_to_hub in evaluation tracker args.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment