Implementation:Datajuicer Data juicer DataJuicerTestCaseBase Pattern
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Testing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Pattern documentation for writing unit tests for Data-Juicer operators using the framework's test utilities.
Description
Data-Juicer provides DataJuicerTestCaseBase in data_juicer/utils/unittest_utils.py as a base class for operator tests. Tests create in-memory datasets using Dataset.from_dict(), instantiate operators with test parameters, apply them via op.process(sample) or dataset.process(op), and verify outputs. Tests are organized in tests/ops/mapper/, tests/ops/filter/, etc.
Usage
Create a test file in the appropriate tests/ops/ subdirectory. Extend DataJuicerTestCaseBase or use standard pytest/unittest patterns.
Code Reference
Source Location
- Repository: data-juicer
- File: tests/ops/ directory, data_juicer/utils/unittest_utils.py
- Lines: Various (test pattern)
Interface Specification
import unittest
from datasets import Dataset
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
class TestMyFilter(DataJuicerTestCaseBase):
def test_basic_filtering(self):
# 1. Build synthetic dataset
ds = Dataset.from_dict({
'text': ['This is a long enough sample text for testing.',
'Short.'],
})
# 2. Instantiate operator
from data_juicer.ops.filter.text_length_filter import TextLengthFilter
op = TextLengthFilter(min_len=10, max_len=1000)
# 3. Compute stats and apply filter
ds = ds.map(op.compute_stats_single)
ds = ds.filter(op.process_single)
# 4. Assert results
self.assertEqual(len(ds), 1)
self.assertIn('long enough', ds[0]['text'])
Import
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| test dataset | Dataset | Yes | Synthetic dataset built with Dataset.from_dict() |
| operator | OP | Yes | Operator instance to test |
Outputs
| Name | Type | Description |
|---|---|---|
| test result | bool | Pass/fail based on assertions |
Usage Examples
Testing a Custom Mapper
import unittest
from datasets import Dataset
class TestMyMapper(unittest.TestCase):
def test_lowercase(self):
ds = Dataset.from_dict({'text': ['Hello World', 'FOO BAR']})
from my_ops.lowercase_mapper import LowercaseMapper
op = LowercaseMapper()
# Apply mapper to each sample
result = ds.map(op.process_single)
self.assertEqual(result[0]['text'], 'hello world')
self.assertEqual(result[1]['text'], 'foo bar')
def test_empty_input(self):
ds = Dataset.from_dict({'text': ['']})
op = LowercaseMapper()
result = ds.map(op.process_single)
self.assertEqual(result[0]['text'], '')
if __name__ == '__main__':
unittest.main()
Run Tests
# Run all operator tests
pytest tests/ops/ -v
# Run specific test file
pytest tests/ops/filter/test_text_length_filter.py -v