Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DJ MCP Recipe Flow

From Leeroopedia
Knowledge Sources
Domains Tooling
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for exposing recipe-based data processing workflows as MCP tools provided by Data-Juicer.

Description

DJ_MCP_Recipe_Flow creates an MCP server in recipe-flow mode that exposes two high-level tools: get_data_processing_ops for discovering available operators (filterable by type and tags like modality, resource, and model requirements) and run_data_recipe for executing a multi-step data processing pipeline. The get_data_processing_ops function uses OPSearcher to query operators by type and tags, returning their descriptions and signatures. The run_data_recipe function accepts a dataset path and a list of operator configurations (as dictionaries), builds a Data-Juicer config, and delegates execution to execute_op.

Usage

Use when AI agents need to discover operators and compose multi-operator data processing recipes through a conversational workflow, enabling high-level recipe-based interactions.

Code Reference

Source Location

Signature

def get_data_processing_ops(
    op_type: Optional[str] = None,
    tags: Optional[List[str]] = None,
    match_all: bool = True,
) -> dict:

def run_data_recipe(
    dataset_path: str,
    process: list[Dict],
    export_path: Optional[str] = None,
    np: int = 1,
) -> str:

def create_mcp_server(port: str = "8000"):

Import

from data_juicer.tools.DJ_mcp_recipe_flow import create_mcp_server, get_data_processing_ops, run_data_recipe

I/O Contract

Inputs

Name Type Required Description
op_type str No Operator type filter: aggregator, deduplicator, filter, grouper, mapper, or selector. Default: None
tags List[str] No Tags to filter operators by (e.g. "text", "image", "gpu", "api"). Default: None
match_all bool No If True, match all tags; if False, match any. Default: True
dataset_path str Yes (for run_data_recipe) Path to the dataset to be processed
process list[Dict] Yes (for run_data_recipe) List of operator configurations to execute sequentially
export_path str No Path to export the processed dataset. Default: './outputs'
np int No Number of processes. Default: 1

Outputs

Name Type Description
ops_dict dict Dictionary mapping operator names to their descriptions, parameter docs, and signatures
result str Result message from executing the data recipe

Usage Examples

from data_juicer.tools.DJ_mcp_recipe_flow import get_data_processing_ops, run_data_recipe

# Discover available filter operators for text data
ops = get_data_processing_ops(op_type="filter", tags=["text"])

# Run a multi-step data recipe
result = run_data_recipe(
    dataset_path="/path/to/dataset.jsonl",
    process=[
        {"text_length_filter": {"min_len": 10, "max_len": 50}},
        {"language_id_score_filter": {"lang": "en", "min_score": 0.8}}
    ],
    export_path="/path/to/output/",
    np=4
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment