Implementation:Datajuicer Data juicer Model Utils
| Knowledge Sources | |
|---|---|
| Domains | Model Management, Machine Learning, API Integration |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Massive model management utility that handles loading, downloading, caching, and preparing dozens of different model types used by Data-Juicer operators, including API models, HuggingFace transformers, fastText, KenLM, NLTK, YOLO, vLLM, and many more.
Description
The model_utils module is the second-largest file in the utils package and a critical dependency for all model-based operators. It centralizes all model lifecycle management with these components:
Model Cache and Download:
- MODEL_ZOO -- Global dictionary caching loaded model instances to avoid redundant initialization.
- MODEL_LINKS / BACKUP_MODEL_LINKS -- Primary and backup download URLs for various models, using pattern matching (fnmatch) for URL resolution.
check_model-- Checks if a model exists in the cache directory (DATA_JUICER_MODELS_CACHE) or external model home (DATA_JUICER_EXTERNAL_MODELS_HOME), downloading from primary or backup URLs if not found.check_model_home-- Resolves model paths through the external models home directory.
API Model Wrappers:
- ChatAPIModel -- OpenAI-compatible chat API wrapper with configurable endpoint, response path extraction, and error handling.
- EmbeddingAPIModel -- Embedding API wrapper for vector generation endpoints.
- ResponsesAPIModel -- OpenAI Responses API wrapper.
prepare_api_model-- Factory function that selects the appropriate API class based on the endpoint path, with optional processor (tokenizer) initialization.
Model Preparation Functions:
The module provides prepare_* functions for each supported model type:
prepare_fasttext_model-- FastText language identification models.prepare_huggingface_model-- HuggingFace transformers with automatic model class detection.prepare_vllm_model-- vLLM serving with tensor parallelism support.prepare_diffusion_model-- Diffusion pipelines (image2image, text2image, inpainting).prepare_kenlm_model-- KenLM language models.prepare_nltk_model/prepare_nltk_pos_tagger-- NLTK punkt tokenizers and POS taggers.prepare_sentencepiece_model-- SentencePiece tokenizers.prepare_simple_aesthetics_model-- CLIP-based aesthetics predictors.prepare_recognizeAnything_model-- RAM (Recognize Anything Model).prepare_spacy_model-- SpaCy language models with compressed archive support.prepare_video_blip_model-- Video-BLIP models with custom video vision model.prepare_yolo_model/prepare_fastsam_model-- YOLO and FastSAM detection models.prepare_dwpose_model-- DWPose detection with ONNX models.prepare_wilor_model/prepare_hawor_model-- Hand reconstruction models.prepare_embedding_model-- Transformer-based embedding models with pooling strategies.prepare_mmlab_model/ MMLabModel -- MMDeploy-based models.- And many more (deepcalib, moge, vggt, video_depth_anything, sam_3d_body, sdxl).
Model Lifecycle:
- MODEL_FUNCTION_MAPPING -- Registry mapping model type strings to their
prepare_*functions. prepare_model-- Entry point that creates afunctools.partialmodel key, pre-initializing models that need file locks.get_model-- Retrieves or initializes models from MODEL_ZOO with device placement and thread configuration.free_models-- Releases model memory and clears CUDA cache.
Usage
Use this module to load any model required by an operator. Models are typically prepared once via prepare_model and then retrieved per-worker via get_model, with automatic GPU placement and caching.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/utils/model_utils.py
Signature
class ChatAPIModel:
def __init__(self, model=None, endpoint=None,
response_path=None, **kwargs): ...
def __call__(self, messages, **kwargs) -> str: ...
class EmbeddingAPIModel:
def __init__(self, model=None, endpoint=None,
response_path=None, **kwargs): ...
def __call__(self, input, **kwargs) -> list: ...
def check_model(model_name, force=False) -> str: ...
def prepare_model(model_type, **model_kwargs) -> partial: ...
def get_model(model_key=None, rank=None, use_cuda=False): ...
def free_models(clear_model_zoo=True): ...
# 25+ prepare_* functions for different model types
def prepare_api_model(model, *, endpoint=None, ...): ...
def prepare_huggingface_model(pretrained_model_name_or_path, ...): ...
def prepare_vllm_model(pretrained_model_name_or_path, ...): ...
# ... etc.
Import
from data_juicer.utils.model_utils import (
prepare_model, get_model, free_models,
ChatAPIModel, check_model
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_type | str | Yes | Type key from MODEL_FUNCTION_MAPPING (e.g., "api", "huggingface", "vllm", "fasttext"). |
| model_name | str | Yes | Model name or path for downloading/loading. |
| model_key | partial | No | Model key returned by prepare_model, used by get_model. |
| rank | int | No | GPU rank for device placement. |
| use_cuda | bool | No | Whether to use CUDA for model inference. |
| force | bool | No | Force re-download of model files. |
Outputs
| Name | Type | Description |
|---|---|---|
| model_key | functools.partial | Callable partial that initializes the model when invoked with a device parameter. |
| model | varies | Loaded model instance (type depends on model_type). |
| processor | varies | Optional processor/tokenizer returned alongside some models. |
Usage Examples
from data_juicer.utils.model_utils import (
prepare_model, get_model, free_models, ChatAPIModel
)
# Prepare and get a fasttext model
model_key = prepare_model("fasttext", model_name="lid.176.bin")
model = get_model(model_key, use_cuda=False)
predictions = model.predict("Hello world")
# Use an API model
model_key = prepare_model("api", model="gpt-4",
base_url="https://api.openai.com/v1")
api_model = get_model(model_key)
response = api_model([{"role": "user", "content": "Hello"}])
# Prepare a HuggingFace model with GPU
model_key = prepare_model("huggingface",
pretrained_model_name_or_path="bert-base-uncased")
model, processor = get_model(model_key, use_cuda=True)
# Clean up
free_models()
Related Pages
- Datajuicer_Data_juicer_NLTK_Utils -- NLTK-specific utilities used by prepare_nltk_model
- Datajuicer_Data_juicer_Multimodal_Utils -- Multimodal data loading for model inputs
- Datajuicer_Data_juicer_Process_Utils -- Resource allocation for model-based operators