Implementation:Datajuicer Data juicer DJ MCP Recipe Flow
| Knowledge Sources | |
|---|---|
| Domains | Tooling |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for exposing recipe-based data processing workflows as MCP tools provided by Data-Juicer.
Description
DJ_MCP_Recipe_Flow creates an MCP server in recipe-flow mode that exposes two high-level tools: get_data_processing_ops for discovering available operators (filterable by type and tags like modality, resource, and model requirements) and run_data_recipe for executing a multi-step data processing pipeline. The get_data_processing_ops function uses OPSearcher to query operators by type and tags, returning their descriptions and signatures. The run_data_recipe function accepts a dataset path and a list of operator configurations (as dictionaries), builds a Data-Juicer config, and delegates execution to execute_op.
Usage
Use when AI agents need to discover operators and compose multi-operator data processing recipes through a conversational workflow, enabling high-level recipe-based interactions.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/tools/DJ_mcp_recipe_flow.py
Signature
def get_data_processing_ops(
op_type: Optional[str] = None,
tags: Optional[List[str]] = None,
match_all: bool = True,
) -> dict:
def run_data_recipe(
dataset_path: str,
process: list[Dict],
export_path: Optional[str] = None,
np: int = 1,
) -> str:
def create_mcp_server(port: str = "8000"):
Import
from data_juicer.tools.DJ_mcp_recipe_flow import create_mcp_server, get_data_processing_ops, run_data_recipe
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| op_type | str | No | Operator type filter: aggregator, deduplicator, filter, grouper, mapper, or selector. Default: None |
| tags | List[str] | No | Tags to filter operators by (e.g. "text", "image", "gpu", "api"). Default: None |
| match_all | bool | No | If True, match all tags; if False, match any. Default: True |
| dataset_path | str | Yes (for run_data_recipe) | Path to the dataset to be processed |
| process | list[Dict] | Yes (for run_data_recipe) | List of operator configurations to execute sequentially |
| export_path | str | No | Path to export the processed dataset. Default: './outputs' |
| np | int | No | Number of processes. Default: 1 |
Outputs
| Name | Type | Description |
|---|---|---|
| ops_dict | dict | Dictionary mapping operator names to their descriptions, parameter docs, and signatures |
| result | str | Result message from executing the data recipe |
Usage Examples
from data_juicer.tools.DJ_mcp_recipe_flow import get_data_processing_ops, run_data_recipe
# Discover available filter operators for text data
ops = get_data_processing_ops(op_type="filter", tags=["text"])
# Run a multi-step data recipe
result = run_data_recipe(
dataset_path="/path/to/dataset.jsonl",
process=[
{"text_length_filter": {"min_len": 10, "max_len": 50}},
{"language_id_score_filter": {"lang": "en", "min_score": 0.8}}
],
export_path="/path/to/output/",
np=4
)