Environment:Datajuicer Data juicer LLM API Credentials Environment

Knowledge Sources	Data-Juicer OpenAI API
Domains	LLMs, API_Integration, Data_Generation
Last Updated	2026-02-14 17:00 GMT

Overview

API credential environment for LLM-based operators requiring OpenAI-compatible endpoints, DashScope, or HuggingFace model access for data generation, calibration, and optimization workflows.

Description

This environment manages the API credentials and endpoint configuration for Data-Juicer operators that use external LLM services. It supports OpenAI-compatible APIs (including self-hosted vLLM endpoints), Alibaba DashScope, and HuggingFace model hub access. The system uses an OpenAI client abstraction with configurable base URLs, allowing operators to point to any OpenAI-compatible server (vLLM, ollama, etc.) without code changes.

Usage

Use this environment when running LLM-powered data generation workflows including `GenerateQAFromTextMapper`, `CalibrateQAMapper`, `OptimizeQAMapper`, `SentenceAugmentationMapper`, and any operator using the `LLMInferenceWithRayVLLMPipeline`. Also required for LLM-based filters (`llm_quality_score_filter`, `llm_difficulty_score_filter`, `llm_perplexity_filter`) and dialog analysis mappers.

System Requirements

Category	Requirement	Notes
Network	Internet access or local LLM endpoint	API calls require HTTP/HTTPS connectivity
RAM	Varies by deployment	Local vLLM requires 16GB+ VRAM; API-only needs minimal resources

Dependencies

Python Packages (ai_services extra)

`openai`
`dashscope`
`tiktoken` (for tokenization/processor support)

Optional Packages

`vllm` == 0.11.0 (for self-hosted LLM endpoints)
`label-studio` == 1.17.0 (for annotation workflows)

Credentials

NEVER store actual API keys in configuration files or code.

The following environment variables must be set:

`OPENAI_API_KEY`: API key for OpenAI-compatible endpoints (required for LLM operators)
`OPENAI_BASE_URL`: Base URL for the API endpoint (default: OpenAI servers; override for self-hosted vLLM)
`DASHSCOPE_API_KEY`: API key for Alibaba DashScope services
`HF_TOKEN`: HuggingFace API token for accessing gated models
`SMTP_USER`: SMTP username for email notification operators
`SMTP_PASSWORD`: SMTP password for email notification operators
`SMTP_CERT_FILE`: Path to SMTP client certificate

Quick Install

# Install AI service dependencies
pip install "py-data-juicer[ai_services]"

# Set API credentials
export OPENAI_API_KEY="your-key-here"
export OPENAI_BASE_URL="https://api.openai.com/v1"  # or your vLLM endpoint

# For self-hosted vLLM endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed"  # vLLM doesn't require a real key

Code Evidence

OpenAI client initialization from `model_utils.py:209`:

self._client = openai.OpenAI(**client_args)

vLLM pipeline API credentials from `llm_inference_with_ray_vllm_pipeline.py:94-100`:

openai_api_base = os.environ.get("OPENAI_BASE_URL", None)
openai_api_key = os.environ.get("OPENAI_API_KEY", None)

Processor initialization error with credential hints from `model_utils.py:411-415`:

raise ValueError(
    "Failed to initialize the processor. Please check the following:\n"
    "- For OpenAI models: Install 'tiktoken' via `pip install tiktoken`.\n"
    "- For DashScope models: Install both 'dashscope' and 'tiktoken'.\n"
    "- For custom models: Use the 'processor_config' parameter."
)

No model specified fallback from `model_utils.py:211-214`:

logger.warning("No model specified. Using the first available model from the server.")
models = self._client.models.list()
if not models.data:
    raise ValueError("No models available on the server.")

Common Errors

Error Message	Cause	Solution
`Failed to initialize the processor`	Missing tiktoken or dashscope package	`pip install tiktoken dashscope`
`No model specified. Using the first available model`	Model name not configured	Set `model` parameter in operator config
`No models available on the server`	API endpoint has no accessible models	Check `OPENAI_BASE_URL` and server status
`Embedding API error`	API call failed during embedding generation	Check API key validity and network connectivity
`Responses API error`	LLM response generation failed	Check rate limits, API key permissions, and model availability

Compatibility Notes

OpenAI-Compatible: Any server implementing the OpenAI API spec works (vLLM, ollama, LiteLLM, etc.). Set `OPENAI_BASE_URL` to point to your endpoint.
DashScope: Alibaba Cloud's AI service; requires separate `dashscope` package and API key.
Self-Hosted vLLM: When running vLLM locally, `OPENAI_API_KEY` can be set to any non-empty string as authentication is typically disabled.
Model Auto-Detection: If no model name is specified, the system queries the endpoint and uses the first available model.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment