Environment:Datajuicer Data juicer LLM API Credentials Environment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, API_Integration, Data_Generation |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
API credential environment for LLM-based operators requiring OpenAI-compatible endpoints, DashScope, or HuggingFace model access for data generation, calibration, and optimization workflows.
Description
This environment manages the API credentials and endpoint configuration for Data-Juicer operators that use external LLM services. It supports OpenAI-compatible APIs (including self-hosted vLLM endpoints), Alibaba DashScope, and HuggingFace model hub access. The system uses an OpenAI client abstraction with configurable base URLs, allowing operators to point to any OpenAI-compatible server (vLLM, ollama, etc.) without code changes.
Usage
Use this environment when running LLM-powered data generation workflows including `GenerateQAFromTextMapper`, `CalibrateQAMapper`, `OptimizeQAMapper`, `SentenceAugmentationMapper`, and any operator using the `LLMInferenceWithRayVLLMPipeline`. Also required for LLM-based filters (`llm_quality_score_filter`, `llm_difficulty_score_filter`, `llm_perplexity_filter`) and dialog analysis mappers.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Network | Internet access or local LLM endpoint | API calls require HTTP/HTTPS connectivity |
| RAM | Varies by deployment | Local vLLM requires 16GB+ VRAM; API-only needs minimal resources |
Dependencies
Python Packages (ai_services extra)
- `openai`
- `dashscope`
- `tiktoken` (for tokenization/processor support)
Optional Packages
- `vllm` == 0.11.0 (for self-hosted LLM endpoints)
- `label-studio` == 1.17.0 (for annotation workflows)
Credentials
NEVER store actual API keys in configuration files or code.
The following environment variables must be set:
- `OPENAI_API_KEY`: API key for OpenAI-compatible endpoints (required for LLM operators)
- `OPENAI_BASE_URL`: Base URL for the API endpoint (default: OpenAI servers; override for self-hosted vLLM)
- `DASHSCOPE_API_KEY`: API key for Alibaba DashScope services
- `HF_TOKEN`: HuggingFace API token for accessing gated models
- `SMTP_USER`: SMTP username for email notification operators
- `SMTP_PASSWORD`: SMTP password for email notification operators
- `SMTP_CERT_FILE`: Path to SMTP client certificate
Quick Install
# Install AI service dependencies
pip install "py-data-juicer[ai_services]"
# Set API credentials
export OPENAI_API_KEY="your-key-here"
export OPENAI_BASE_URL="https://api.openai.com/v1" # or your vLLM endpoint
# For self-hosted vLLM endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed" # vLLM doesn't require a real key
Code Evidence
OpenAI client initialization from `model_utils.py:209`:
self._client = openai.OpenAI(**client_args)
vLLM pipeline API credentials from `llm_inference_with_ray_vllm_pipeline.py:94-100`:
openai_api_base = os.environ.get("OPENAI_BASE_URL", None)
openai_api_key = os.environ.get("OPENAI_API_KEY", None)
Processor initialization error with credential hints from `model_utils.py:411-415`:
raise ValueError(
"Failed to initialize the processor. Please check the following:\n"
"- For OpenAI models: Install 'tiktoken' via `pip install tiktoken`.\n"
"- For DashScope models: Install both 'dashscope' and 'tiktoken'.\n"
"- For custom models: Use the 'processor_config' parameter."
)
No model specified fallback from `model_utils.py:211-214`:
logger.warning("No model specified. Using the first available model from the server.")
models = self._client.models.list()
if not models.data:
raise ValueError("No models available on the server.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Failed to initialize the processor` | Missing tiktoken or dashscope package | `pip install tiktoken dashscope` |
| `No model specified. Using the first available model` | Model name not configured | Set `model` parameter in operator config |
| `No models available on the server` | API endpoint has no accessible models | Check `OPENAI_BASE_URL` and server status |
| `Embedding API error` | API call failed during embedding generation | Check API key validity and network connectivity |
| `Responses API error` | LLM response generation failed | Check rate limits, API key permissions, and model availability |
Compatibility Notes
- OpenAI-Compatible: Any server implementing the OpenAI API spec works (vLLM, ollama, LiteLLM, etc.). Set `OPENAI_BASE_URL` to point to your endpoint.
- DashScope: Alibaba Cloud's AI service; requires separate `dashscope` package and API key.
- Self-Hosted vLLM: When running vLLM locally, `OPENAI_API_KEY` can be set to any non-empty string as authentication is typically disabled.
- Model Auto-Detection: If no model name is specified, the system queries the endpoint and uses the first available model.