Implementation:Datajuicer Data juicer Service API
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, REST_API |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for exposing Data-Juicer operators and functions as HTTP REST API endpoints provided by Data-Juicer.
Description
service.py is a FastAPI-based REST service that dynamically registers all Data-Juicer classes and functions as HTTP endpoints. It traverses specified directories for __init__.py files and registers objects defined in __all__: classes are exposed as POST endpoints at /{module_path}/{ClassName}/{method_name} and functions as GET endpoints at /{module_path}/{function_name}. The service handles JSON body parsing for class initialization, query parameter parsing for method invocation, automatic dataset loading via DatasetBuilder, result export via Exporter, and special JSON-encoded parameter prefixes. Only methods in the allowed_methods whitelist (run, process, compute_stats, compute_hash, analyze, compute, process_single, process_batched, compute_stats_single, compute_stats_batched) are registered for class endpoints.
Usage
Use when you need programmatic HTTP access to Data-Juicer operators from external systems, web UIs, or microservice architectures without importing Data-Juicer directly.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: service.py
Signature
app = FastAPI()
def register_objects_from_init(directory: str):
"""
Traverse the specified directory for __init__.py files and
register objects defined in __all__.
"""
def register_class(module, cls):
"""Register class and its methods as POST endpoints."""
def register_function(module, func):
"""Register a function as a GET endpoint."""
def _invoke(callable, request):
"""Parse query params, load datasets, invoke callable, export results."""
def _parse_json_dumps(params: Dict, prefix="<json_dumps>"):
"""Parse parameters with special JSON dump prefix."""
def _setup_cfg(params: Dict):
"""Convert string 'cfg' parameter to jsonargparse Namespace."""
def _setup_dataset(params: Dict):
"""Setup dataset loading and result exporting from 'dataset' parameter."""
def _get_public_methods(cls, allowed=None):
"""Get public methods of a class filtered by allowed set."""
Import
from service import app, register_objects_from_init, register_class, register_function
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| directory | str | Yes (register) | Directory path to scan for __init__.py files with __all__ exports |
| request body (POST) | JSON | Yes (class endpoints) | JSON object with class __init__ keyword arguments |
| query params (GET/POST) | URL query string | No | Method or function arguments as query parameters |
| dataset | str (query param) | No | Dataset path; triggers automatic loading via DatasetBuilder and result export |
| skip_return | bool (query param) | No | If true, return empty string instead of result. Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| JSON response | {"status": "success", "result": ...} | Success response with the callable's return value |
| HTTP 500 | {"detail": str} | Error response with exception message |
| exported files | Files on disk | When dataset param is provided, processed results exported to outputs/{timestamp}/processed_data.jsonl |
Usage Examples
Starting the Service
# Start with uvicorn
# uvicorn service:app --host 0.0.0.0 --port 8000
Calling an Operator via POST
import requests
# Call a filter operator's process method
response = requests.post(
"http://localhost:8000/data_juicer/ops/filter/TextLengthFilter/process",
json={"min_len": 10, "max_len": 10000},
params={"dataset": "./data/input.jsonl"}
)
result = response.json()
print(result["status"]) # "success"
print(result["result"]) # path to exported results
Calling a Function via GET
import requests
# Call a registered function
response = requests.get(
"http://localhost:8000/data_juicer/core/function_name",
params={"arg1": "value1"}
)
print(response.json())