Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DataJuicerTestCaseBase Pattern

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Testing
Last Updated 2026-02-14 17:00 GMT

Overview

Pattern documentation for writing unit tests for Data-Juicer operators using the framework's test utilities.

Description

Data-Juicer provides DataJuicerTestCaseBase in data_juicer/utils/unittest_utils.py as a base class for operator tests. Tests create in-memory datasets using Dataset.from_dict(), instantiate operators with test parameters, apply them via op.process(sample) or dataset.process(op), and verify outputs. Tests are organized in tests/ops/mapper/, tests/ops/filter/, etc.

Usage

Create a test file in the appropriate tests/ops/ subdirectory. Extend DataJuicerTestCaseBase or use standard pytest/unittest patterns.

Code Reference

Source Location

  • Repository: data-juicer
  • File: tests/ops/ directory, data_juicer/utils/unittest_utils.py
  • Lines: Various (test pattern)

Interface Specification

import unittest
from datasets import Dataset
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase

class TestMyFilter(DataJuicerTestCaseBase):

    def test_basic_filtering(self):
        # 1. Build synthetic dataset
        ds = Dataset.from_dict({
            'text': ['This is a long enough sample text for testing.',
                     'Short.'],
        })

        # 2. Instantiate operator
        from data_juicer.ops.filter.text_length_filter import TextLengthFilter
        op = TextLengthFilter(min_len=10, max_len=1000)

        # 3. Compute stats and apply filter
        ds = ds.map(op.compute_stats_single)
        ds = ds.filter(op.process_single)

        # 4. Assert results
        self.assertEqual(len(ds), 1)
        self.assertIn('long enough', ds[0]['text'])

Import

from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase

I/O Contract

Inputs

Name Type Required Description
test dataset Dataset Yes Synthetic dataset built with Dataset.from_dict()
operator OP Yes Operator instance to test

Outputs

Name Type Description
test result bool Pass/fail based on assertions

Usage Examples

Testing a Custom Mapper

import unittest
from datasets import Dataset

class TestMyMapper(unittest.TestCase):

    def test_lowercase(self):
        ds = Dataset.from_dict({'text': ['Hello World', 'FOO BAR']})

        from my_ops.lowercase_mapper import LowercaseMapper
        op = LowercaseMapper()

        # Apply mapper to each sample
        result = ds.map(op.process_single)
        self.assertEqual(result[0]['text'], 'hello world')
        self.assertEqual(result[1]['text'], 'foo bar')

    def test_empty_input(self):
        ds = Dataset.from_dict({'text': ['']})
        op = LowercaseMapper()
        result = ds.map(op.process_single)
        self.assertEqual(result[0]['text'], '')

if __name__ == '__main__':
    unittest.main()

Run Tests

# Run all operator tests
pytest tests/ops/ -v

# Run specific test file
pytest tests/ops/filter/test_text_length_filter.py -v

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment