Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Haotian liu LLaVA OpenAI API Evaluation Environment

From Leeroopedia
Knowledge Sources
Domains Evaluation, NLP
Last Updated 2026-02-13 23:00 GMT

Overview

Python environment with OpenAI API access and Ray for parallel GPT-4-based evaluation scoring of LLaVA model outputs.

Description

This environment provides the additional dependencies required for GPT-4-based evaluation workflows. Beyond the core LLaVA training environment, it requires the `openai` Python client for ChatCompletion API calls and `ray` for parallelizing evaluation requests across multiple CPUs. The evaluation scripts use GPT-4 as a judge to score model-generated answers against reference answers.

Usage

Use this environment when running the Benchmark Evaluation workflow steps that involve GPT-4 review scoring: `eval_gpt_review.py`, `eval_gpt_review_bench.py`, `eval_gpt_review_visual.py`, and `eval_science_qa_gpt4.py`. These scripts call the OpenAI API and require a valid API key with GPT-4 access.

System Requirements

Category Requirement Notes
OS Any (Linux, macOS, Windows) CPU-only; no GPU required for evaluation scoring
Network Internet access Required for OpenAI API calls
API Access OpenAI API with GPT-4 model access Rate limits apply; scripts include retry with sleep

Dependencies

Python Packages

  • `openai` (ChatCompletion API client)
  • `ray` (distributed task execution, used with `@ray.remote(num_cpus=4)`)
  • `tqdm` (progress bars)

Credentials

The following environment variables must be set:

  • `OPENAI_API_KEY`: OpenAI API key with access to the `gpt-4` model. Required for all GPT-4 evaluation scripts.

Quick Install

pip install openai ray tqdm
export OPENAI_API_KEY="your-api-key-here"

Code Evidence

OpenAI API usage from `eval_gpt_review.py:12-36`:

@ray.remote(num_cpus=4)
def get_eval(content: str, max_tokens: int):
    while True:
        try:
            response = openai.ChatCompletion.create(
                model='gpt-4',
                messages=[{
                    'role': 'system',
                    'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
                }, {
                    'role': 'user',
                    'content': content,
                }],
                temperature=0.2,
                max_tokens=max_tokens,
            )
            break
        except openai.error.RateLimitError:
            pass
        except Exception as e:
            print(e)
        time.sleep(NUM_SECONDS_TO_SLEEP)

Rate limit handling from `eval_gpt_review.py:10`:

NUM_SECONDS_TO_SLEEP = 3

Common Errors

Error Message Cause Solution
`openai.error.AuthenticationError` Invalid or missing `OPENAI_API_KEY` Set valid API key: `export OPENAI_API_KEY="sk-..."`
`openai.error.RateLimitError` Too many API requests Script handles this automatically with 3-second retry sleep
`ray.exceptions.RaySystemError` Ray not initialized Ensure `ray.init()` is called (done in script `__main__`)
Model `gpt-4` not available API key lacks GPT-4 access Upgrade OpenAI API plan or request GPT-4 access

Compatibility Notes

  • API Version: The evaluation scripts use the legacy `openai.ChatCompletion.create()` API (pre-v1.0 openai package). If using `openai>=1.0`, the scripts need modification.
  • Cost: Each evaluation run makes one GPT-4 API call per question. Large evaluation sets can incur significant costs.
  • Parallelism: Ray is configured with `num_cpus=4` per evaluation task. Adjust based on available CPU cores and API rate limits.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment