Environment:Haotian liu LLaVA OpenAI API Evaluation Environment
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Python environment with OpenAI API access and Ray for parallel GPT-4-based evaluation scoring of LLaVA model outputs.
Description
This environment provides the additional dependencies required for GPT-4-based evaluation workflows. Beyond the core LLaVA training environment, it requires the `openai` Python client for ChatCompletion API calls and `ray` for parallelizing evaluation requests across multiple CPUs. The evaluation scripts use GPT-4 as a judge to score model-generated answers against reference answers.
Usage
Use this environment when running the Benchmark Evaluation workflow steps that involve GPT-4 review scoring: `eval_gpt_review.py`, `eval_gpt_review_bench.py`, `eval_gpt_review_visual.py`, and `eval_science_qa_gpt4.py`. These scripts call the OpenAI API and require a valid API key with GPT-4 access.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Any (Linux, macOS, Windows) | CPU-only; no GPU required for evaluation scoring |
| Network | Internet access | Required for OpenAI API calls |
| API Access | OpenAI API with GPT-4 model access | Rate limits apply; scripts include retry with sleep |
Dependencies
Python Packages
- `openai` (ChatCompletion API client)
- `ray` (distributed task execution, used with `@ray.remote(num_cpus=4)`)
- `tqdm` (progress bars)
Credentials
The following environment variables must be set:
- `OPENAI_API_KEY`: OpenAI API key with access to the `gpt-4` model. Required for all GPT-4 evaluation scripts.
Quick Install
pip install openai ray tqdm
export OPENAI_API_KEY="your-api-key-here"
Code Evidence
OpenAI API usage from `eval_gpt_review.py:12-36`:
@ray.remote(num_cpus=4)
def get_eval(content: str, max_tokens: int):
while True:
try:
response = openai.ChatCompletion.create(
model='gpt-4',
messages=[{
'role': 'system',
'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
}, {
'role': 'user',
'content': content,
}],
temperature=0.2,
max_tokens=max_tokens,
)
break
except openai.error.RateLimitError:
pass
except Exception as e:
print(e)
time.sleep(NUM_SECONDS_TO_SLEEP)
Rate limit handling from `eval_gpt_review.py:10`:
NUM_SECONDS_TO_SLEEP = 3
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `openai.error.AuthenticationError` | Invalid or missing `OPENAI_API_KEY` | Set valid API key: `export OPENAI_API_KEY="sk-..."` |
| `openai.error.RateLimitError` | Too many API requests | Script handles this automatically with 3-second retry sleep |
| `ray.exceptions.RaySystemError` | Ray not initialized | Ensure `ray.init()` is called (done in script `__main__`) |
| Model `gpt-4` not available | API key lacks GPT-4 access | Upgrade OpenAI API plan or request GPT-4 access |
Compatibility Notes
- API Version: The evaluation scripts use the legacy `openai.ChatCompletion.create()` API (pre-v1.0 openai package). If using `openai>=1.0`, the scripts need modification.
- Cost: Each evaluation run makes one GPT-4 API call per question. Large evaluation sets can incur significant costs.
- Parallelism: Ray is configured with `num_cpus=4` per evaluation task. Adjust based on available CPU cores and API rate limits.