Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Datajuicer Data juicer LLM API Credentials Environment

From Leeroopedia
Knowledge Sources
Domains LLMs, API_Integration, Data_Generation
Last Updated 2026-02-14 17:00 GMT

Overview

API credential environment for LLM-based operators requiring OpenAI-compatible endpoints, DashScope, or HuggingFace model access for data generation, calibration, and optimization workflows.

Description

This environment manages the API credentials and endpoint configuration for Data-Juicer operators that use external LLM services. It supports OpenAI-compatible APIs (including self-hosted vLLM endpoints), Alibaba DashScope, and HuggingFace model hub access. The system uses an OpenAI client abstraction with configurable base URLs, allowing operators to point to any OpenAI-compatible server (vLLM, ollama, etc.) without code changes.

Usage

Use this environment when running LLM-powered data generation workflows including `GenerateQAFromTextMapper`, `CalibrateQAMapper`, `OptimizeQAMapper`, `SentenceAugmentationMapper`, and any operator using the `LLMInferenceWithRayVLLMPipeline`. Also required for LLM-based filters (`llm_quality_score_filter`, `llm_difficulty_score_filter`, `llm_perplexity_filter`) and dialog analysis mappers.

System Requirements

Category Requirement Notes
Network Internet access or local LLM endpoint API calls require HTTP/HTTPS connectivity
RAM Varies by deployment Local vLLM requires 16GB+ VRAM; API-only needs minimal resources

Dependencies

Python Packages (ai_services extra)

  • `openai`
  • `dashscope`
  • `tiktoken` (for tokenization/processor support)

Optional Packages

  • `vllm` == 0.11.0 (for self-hosted LLM endpoints)
  • `label-studio` == 1.17.0 (for annotation workflows)

Credentials

NEVER store actual API keys in configuration files or code.

The following environment variables must be set:

  • `OPENAI_API_KEY`: API key for OpenAI-compatible endpoints (required for LLM operators)
  • `OPENAI_BASE_URL`: Base URL for the API endpoint (default: OpenAI servers; override for self-hosted vLLM)
  • `DASHSCOPE_API_KEY`: API key for Alibaba DashScope services
  • `HF_TOKEN`: HuggingFace API token for accessing gated models
  • `SMTP_USER`: SMTP username for email notification operators
  • `SMTP_PASSWORD`: SMTP password for email notification operators
  • `SMTP_CERT_FILE`: Path to SMTP client certificate

Quick Install

# Install AI service dependencies
pip install "py-data-juicer[ai_services]"

# Set API credentials
export OPENAI_API_KEY="your-key-here"
export OPENAI_BASE_URL="https://api.openai.com/v1"  # or your vLLM endpoint

# For self-hosted vLLM endpoint
export OPENAI_BASE_URL="http://localhost:8000/v1"
export OPENAI_API_KEY="not-needed"  # vLLM doesn't require a real key

Code Evidence

OpenAI client initialization from `model_utils.py:209`:

self._client = openai.OpenAI(**client_args)

vLLM pipeline API credentials from `llm_inference_with_ray_vllm_pipeline.py:94-100`:

openai_api_base = os.environ.get("OPENAI_BASE_URL", None)
openai_api_key = os.environ.get("OPENAI_API_KEY", None)

Processor initialization error with credential hints from `model_utils.py:411-415`:

raise ValueError(
    "Failed to initialize the processor. Please check the following:\n"
    "- For OpenAI models: Install 'tiktoken' via `pip install tiktoken`.\n"
    "- For DashScope models: Install both 'dashscope' and 'tiktoken'.\n"
    "- For custom models: Use the 'processor_config' parameter."
)

No model specified fallback from `model_utils.py:211-214`:

logger.warning("No model specified. Using the first available model from the server.")
models = self._client.models.list()
if not models.data:
    raise ValueError("No models available on the server.")

Common Errors

Error Message Cause Solution
`Failed to initialize the processor` Missing tiktoken or dashscope package `pip install tiktoken dashscope`
`No model specified. Using the first available model` Model name not configured Set `model` parameter in operator config
`No models available on the server` API endpoint has no accessible models Check `OPENAI_BASE_URL` and server status
`Embedding API error` API call failed during embedding generation Check API key validity and network connectivity
`Responses API error` LLM response generation failed Check rate limits, API key permissions, and model availability

Compatibility Notes

  • OpenAI-Compatible: Any server implementing the OpenAI API spec works (vLLM, ollama, LiteLLM, etc.). Set `OPENAI_BASE_URL` to point to your endpoint.
  • DashScope: Alibaba Cloud's AI service; requires separate `dashscope` package and API key.
  • Self-Hosted vLLM: When running vLLM locally, `OPENAI_API_KEY` can be set to any non-empty string as authentication is typically disabled.
  • Model Auto-Detection: If no model name is specified, the system queries the endpoint and uses the first available model.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment