Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Model Utils

From Leeroopedia
Knowledge Sources
Domains Model Management, Machine Learning, API Integration
Last Updated 2026-02-14 16:00 GMT

Overview

Massive model management utility that handles loading, downloading, caching, and preparing dozens of different model types used by Data-Juicer operators, including API models, HuggingFace transformers, fastText, KenLM, NLTK, YOLO, vLLM, and many more.

Description

The model_utils module is the second-largest file in the utils package and a critical dependency for all model-based operators. It centralizes all model lifecycle management with these components:

Model Cache and Download:

  • MODEL_ZOO -- Global dictionary caching loaded model instances to avoid redundant initialization.
  • MODEL_LINKS / BACKUP_MODEL_LINKS -- Primary and backup download URLs for various models, using pattern matching (fnmatch) for URL resolution.
  • check_model -- Checks if a model exists in the cache directory (DATA_JUICER_MODELS_CACHE) or external model home (DATA_JUICER_EXTERNAL_MODELS_HOME), downloading from primary or backup URLs if not found.
  • check_model_home -- Resolves model paths through the external models home directory.

API Model Wrappers:

  • ChatAPIModel -- OpenAI-compatible chat API wrapper with configurable endpoint, response path extraction, and error handling.
  • EmbeddingAPIModel -- Embedding API wrapper for vector generation endpoints.
  • ResponsesAPIModel -- OpenAI Responses API wrapper.
  • prepare_api_model -- Factory function that selects the appropriate API class based on the endpoint path, with optional processor (tokenizer) initialization.

Model Preparation Functions: The module provides prepare_* functions for each supported model type:

  • prepare_fasttext_model -- FastText language identification models.
  • prepare_huggingface_model -- HuggingFace transformers with automatic model class detection.
  • prepare_vllm_model -- vLLM serving with tensor parallelism support.
  • prepare_diffusion_model -- Diffusion pipelines (image2image, text2image, inpainting).
  • prepare_kenlm_model -- KenLM language models.
  • prepare_nltk_model / prepare_nltk_pos_tagger -- NLTK punkt tokenizers and POS taggers.
  • prepare_sentencepiece_model -- SentencePiece tokenizers.
  • prepare_simple_aesthetics_model -- CLIP-based aesthetics predictors.
  • prepare_recognizeAnything_model -- RAM (Recognize Anything Model).
  • prepare_spacy_model -- SpaCy language models with compressed archive support.
  • prepare_video_blip_model -- Video-BLIP models with custom video vision model.
  • prepare_yolo_model / prepare_fastsam_model -- YOLO and FastSAM detection models.
  • prepare_dwpose_model -- DWPose detection with ONNX models.
  • prepare_wilor_model / prepare_hawor_model -- Hand reconstruction models.
  • prepare_embedding_model -- Transformer-based embedding models with pooling strategies.
  • prepare_mmlab_model / MMLabModel -- MMDeploy-based models.
  • And many more (deepcalib, moge, vggt, video_depth_anything, sam_3d_body, sdxl).

Model Lifecycle:

  • MODEL_FUNCTION_MAPPING -- Registry mapping model type strings to their prepare_* functions.
  • prepare_model -- Entry point that creates a functools.partial model key, pre-initializing models that need file locks.
  • get_model -- Retrieves or initializes models from MODEL_ZOO with device placement and thread configuration.
  • free_models -- Releases model memory and clears CUDA cache.

Usage

Use this module to load any model required by an operator. Models are typically prepared once via prepare_model and then retrieved per-worker via get_model, with automatic GPU placement and caching.

Code Reference

Source Location

Signature

class ChatAPIModel:
    def __init__(self, model=None, endpoint=None,
                 response_path=None, **kwargs): ...
    def __call__(self, messages, **kwargs) -> str: ...

class EmbeddingAPIModel:
    def __init__(self, model=None, endpoint=None,
                 response_path=None, **kwargs): ...
    def __call__(self, input, **kwargs) -> list: ...

def check_model(model_name, force=False) -> str: ...
def prepare_model(model_type, **model_kwargs) -> partial: ...
def get_model(model_key=None, rank=None, use_cuda=False): ...
def free_models(clear_model_zoo=True): ...

# 25+ prepare_* functions for different model types
def prepare_api_model(model, *, endpoint=None, ...): ...
def prepare_huggingface_model(pretrained_model_name_or_path, ...): ...
def prepare_vllm_model(pretrained_model_name_or_path, ...): ...
# ... etc.

Import

from data_juicer.utils.model_utils import (
    prepare_model, get_model, free_models,
    ChatAPIModel, check_model
)

I/O Contract

Inputs

Name Type Required Description
model_type str Yes Type key from MODEL_FUNCTION_MAPPING (e.g., "api", "huggingface", "vllm", "fasttext").
model_name str Yes Model name or path for downloading/loading.
model_key partial No Model key returned by prepare_model, used by get_model.
rank int No GPU rank for device placement.
use_cuda bool No Whether to use CUDA for model inference.
force bool No Force re-download of model files.

Outputs

Name Type Description
model_key functools.partial Callable partial that initializes the model when invoked with a device parameter.
model varies Loaded model instance (type depends on model_type).
processor varies Optional processor/tokenizer returned alongside some models.

Usage Examples

from data_juicer.utils.model_utils import (
    prepare_model, get_model, free_models, ChatAPIModel
)

# Prepare and get a fasttext model
model_key = prepare_model("fasttext", model_name="lid.176.bin")
model = get_model(model_key, use_cuda=False)
predictions = model.predict("Hello world")

# Use an API model
model_key = prepare_model("api", model="gpt-4",
                          base_url="https://api.openai.com/v1")
api_model = get_model(model_key)
response = api_model([{"role": "user", "content": "Hello"}])

# Prepare a HuggingFace model with GPU
model_key = prepare_model("huggingface",
                          pretrained_model_name_or_path="bert-base-uncased")
model, processor = get_model(model_key, use_cuda=True)

# Clean up
free_models()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment