Implementation:Datajuicer Data juicer Model Utils

Knowledge Sources	Datajuicer_Data_juicer
Domains	Model Management, Machine Learning, API Integration
Last Updated	2026-02-14 16:00 GMT

Overview

Massive model management utility that handles loading, downloading, caching, and preparing dozens of different model types used by Data-Juicer operators, including API models, HuggingFace transformers, fastText, KenLM, NLTK, YOLO, vLLM, and many more.

Description

The model_utils module is the second-largest file in the utils package and a critical dependency for all model-based operators. It centralizes all model lifecycle management with these components:

Model Cache and Download:

MODEL_ZOO -- Global dictionary caching loaded model instances to avoid redundant initialization.
MODEL_LINKS / BACKUP_MODEL_LINKS -- Primary and backup download URLs for various models, using pattern matching (fnmatch) for URL resolution.
check_model -- Checks if a model exists in the cache directory (DATA_JUICER_MODELS_CACHE) or external model home (DATA_JUICER_EXTERNAL_MODELS_HOME), downloading from primary or backup URLs if not found.
check_model_home -- Resolves model paths through the external models home directory.

API Model Wrappers:

ChatAPIModel -- OpenAI-compatible chat API wrapper with configurable endpoint, response path extraction, and error handling.
EmbeddingAPIModel -- Embedding API wrapper for vector generation endpoints.
ResponsesAPIModel -- OpenAI Responses API wrapper.
prepare_api_model -- Factory function that selects the appropriate API class based on the endpoint path, with optional processor (tokenizer) initialization.

Model Preparation Functions: The module provides prepare_* functions for each supported model type:

prepare_fasttext_model -- FastText language identification models.
prepare_huggingface_model -- HuggingFace transformers with automatic model class detection.
prepare_vllm_model -- vLLM serving with tensor parallelism support.
prepare_diffusion_model -- Diffusion pipelines (image2image, text2image, inpainting).
prepare_kenlm_model -- KenLM language models.
prepare_nltk_model / prepare_nltk_pos_tagger -- NLTK punkt tokenizers and POS taggers.
prepare_sentencepiece_model -- SentencePiece tokenizers.
prepare_simple_aesthetics_model -- CLIP-based aesthetics predictors.
prepare_recognizeAnything_model -- RAM (Recognize Anything Model).
prepare_spacy_model -- SpaCy language models with compressed archive support.
prepare_video_blip_model -- Video-BLIP models with custom video vision model.
prepare_yolo_model / prepare_fastsam_model -- YOLO and FastSAM detection models.
prepare_dwpose_model -- DWPose detection with ONNX models.
prepare_wilor_model / prepare_hawor_model -- Hand reconstruction models.
prepare_embedding_model -- Transformer-based embedding models with pooling strategies.
prepare_mmlab_model / MMLabModel -- MMDeploy-based models.
And many more (deepcalib, moge, vggt, video_depth_anything, sam_3d_body, sdxl).

Model Lifecycle:

MODEL_FUNCTION_MAPPING -- Registry mapping model type strings to their prepare_* functions.
prepare_model -- Entry point that creates a functools.partial model key, pre-initializing models that need file locks.
get_model -- Retrieves or initializes models from MODEL_ZOO with device placement and thread configuration.
free_models -- Releases model memory and clears CUDA cache.

Usage

Use this module to load any model required by an operator. Models are typically prepared once via prepare_model and then retrieved per-worker via get_model, with automatic GPU placement and caching.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/utils/model_utils.py

Signature

class ChatAPIModel:
    def __init__(self, model=None, endpoint=None,
                 response_path=None, **kwargs): ...
    def __call__(self, messages, **kwargs) -> str: ...

class EmbeddingAPIModel:
    def __init__(self, model=None, endpoint=None,
                 response_path=None, **kwargs): ...
    def __call__(self, input, **kwargs) -> list: ...

def check_model(model_name, force=False) -> str: ...
def prepare_model(model_type, **model_kwargs) -> partial: ...
def get_model(model_key=None, rank=None, use_cuda=False): ...
def free_models(clear_model_zoo=True): ...

# 25+ prepare_* functions for different model types
def prepare_api_model(model, *, endpoint=None, ...): ...
def prepare_huggingface_model(pretrained_model_name_or_path, ...): ...
def prepare_vllm_model(pretrained_model_name_or_path, ...): ...
# ... etc.

Import

from data_juicer.utils.model_utils import (
    prepare_model, get_model, free_models,
    ChatAPIModel, check_model
)

I/O Contract

Inputs

Name	Type	Required	Description
model_type	str	Yes	Type key from MODEL_FUNCTION_MAPPING (e.g., "api", "huggingface", "vllm", "fasttext").
model_name	str	Yes	Model name or path for downloading/loading.
model_key	partial	No	Model key returned by prepare_model, used by get_model.
rank	int	No	GPU rank for device placement.
use_cuda	bool	No	Whether to use CUDA for model inference.
force	bool	No	Force re-download of model files.

Outputs

Name	Type	Description
model_key	functools.partial	Callable partial that initializes the model when invoked with a device parameter.
model	varies	Loaded model instance (type depends on model_type).
processor	varies	Optional processor/tokenizer returned alongside some models.

Usage Examples

from data_juicer.utils.model_utils import (
    prepare_model, get_model, free_models, ChatAPIModel
)

# Prepare and get a fasttext model
model_key = prepare_model("fasttext", model_name="lid.176.bin")
model = get_model(model_key, use_cuda=False)
predictions = model.predict("Hello world")

# Use an API model
model_key = prepare_model("api", model="gpt-4",
                          base_url="https://api.openai.com/v1")
api_model = get_model(model_key)
response = api_model([{"role": "user", "content": "Hello"}])

# Prepare a HuggingFace model with GPU
model_key = prepare_model("huggingface",
                          pretrained_model_name_or_path="bert-base-uncased")
model, processor = get_model(model_key, use_cuda=True)

# Clean up
free_models()

Related Pages

Datajuicer_Data_juicer_NLTK_Utils -- NLTK-specific utilities used by prepare_nltk_model
Datajuicer_Data_juicer_Multimodal_Utils -- Multimodal data loading for model inputs
Datajuicer_Data_juicer_Process_Utils -- Resource allocation for model-based operators

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment