Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Service API

From Leeroopedia
Revision as of 12:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datajuicer_Data_juicer_Service_API.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, REST_API
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for exposing Data-Juicer operators and functions as HTTP REST API endpoints provided by Data-Juicer.

Description

service.py is a FastAPI-based REST service that dynamically registers all Data-Juicer classes and functions as HTTP endpoints. It traverses specified directories for __init__.py files and registers objects defined in __all__: classes are exposed as POST endpoints at /{module_path}/{ClassName}/{method_name} and functions as GET endpoints at /{module_path}/{function_name}. The service handles JSON body parsing for class initialization, query parameter parsing for method invocation, automatic dataset loading via DatasetBuilder, result export via Exporter, and special JSON-encoded parameter prefixes. Only methods in the allowed_methods whitelist (run, process, compute_stats, compute_hash, analyze, compute, process_single, process_batched, compute_stats_single, compute_stats_batched) are registered for class endpoints.

Usage

Use when you need programmatic HTTP access to Data-Juicer operators from external systems, web UIs, or microservice architectures without importing Data-Juicer directly.

Code Reference

Source Location

Signature

app = FastAPI()

def register_objects_from_init(directory: str):
    """
    Traverse the specified directory for __init__.py files and
    register objects defined in __all__.
    """

def register_class(module, cls):
    """Register class and its methods as POST endpoints."""

def register_function(module, func):
    """Register a function as a GET endpoint."""

def _invoke(callable, request):
    """Parse query params, load datasets, invoke callable, export results."""

def _parse_json_dumps(params: Dict, prefix="<json_dumps>"):
    """Parse parameters with special JSON dump prefix."""

def _setup_cfg(params: Dict):
    """Convert string 'cfg' parameter to jsonargparse Namespace."""

def _setup_dataset(params: Dict):
    """Setup dataset loading and result exporting from 'dataset' parameter."""

def _get_public_methods(cls, allowed=None):
    """Get public methods of a class filtered by allowed set."""

Import

from service import app, register_objects_from_init, register_class, register_function

I/O Contract

Inputs

Name Type Required Description
directory str Yes (register) Directory path to scan for __init__.py files with __all__ exports
request body (POST) JSON Yes (class endpoints) JSON object with class __init__ keyword arguments
query params (GET/POST) URL query string No Method or function arguments as query parameters
dataset str (query param) No Dataset path; triggers automatic loading via DatasetBuilder and result export
skip_return bool (query param) No If true, return empty string instead of result. Default: False

Outputs

Name Type Description
JSON response {"status": "success", "result": ...} Success response with the callable's return value
HTTP 500 {"detail": str} Error response with exception message
exported files Files on disk When dataset param is provided, processed results exported to outputs/{timestamp}/processed_data.jsonl

Usage Examples

Starting the Service

# Start with uvicorn
# uvicorn service:app --host 0.0.0.0 --port 8000

Calling an Operator via POST

import requests

# Call a filter operator's process method
response = requests.post(
    "http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/process",
    json={"min_len": 10, "max_len": 10000},
    params={"dataset": "./data/input.jsonl"}
)
result = response.json()
print(result["status"])   # "success"
print(result["result"])   # path to exported results

Calling a Function via GET

import requests

# Call a registered function
response = requests.get(
    "http://localhost:8000/data_juicer/core/function_name",
    params={"arg1": "value1"}
)
print(response.json())

Related Pages

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment