Implementation:Allenai Open instruct SFTDatasetProcessor
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for tokenizing and filtering SFT (Supervised Fine-Tuning) datasets provided by the Open Instruct library.
Description
The SFTDatasetProcessor class extends the base DatasetProcessor and provides two key methods: tokenize() and filter(). The tokenize() method applies the tokenizer's chat template to convert conversations into token IDs, constructs attention masks, and creates labels with prompt tokens masked to -100. The filter() method removes examples that exceed configurable length limits or that contain no valid training labels (all -100). Both methods use parallelized dataset.map() and dataset.filter() with automatic CPU count scaling.
Usage
Use SFTDatasetProcessor when you need fine-grained control over the tokenization and filtering of SFT datasets. It is used internally by the dataset transformation pipeline but can also be called directly for custom processing workflows.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/dataset_processor.py - Lines: L278-322
Signature
class SFTDatasetProcessor(DatasetProcessor):
def tokenize(self, dataset: Dataset) -> Dataset:
"""Apply chat template, create input_ids, attention_mask, and labels with prompt masking."""
...
def filter(self, dataset: Dataset, need_contain_labels: bool = True) -> Dataset:
"""Filter out examples exceeding length limits or lacking valid labels."""
...
The parent class DatasetProcessor is initialized with:
class DatasetProcessor:
def __init__(self, tokenizer: PreTrainedTokenizer, config: DatasetConfig) -> None:
self.tokenizer = tokenizer
self.config = config
Import
from open_instruct.dataset_processor import SFTDatasetProcessor
I/O Contract
Inputs
Constructor (inherited from DatasetProcessor):
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | A HuggingFace tokenizer with a configured chat template. |
| config | DatasetConfig | Yes | Configuration dataclass controlling filtering thresholds, parallelism, and column names. |
tokenize() method:
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | A HuggingFace Dataset containing a messages column (default key: "messages").
|
filter() method:
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | A tokenized HuggingFace Dataset with input_ids, input_ids_prompt, and labels columns.
|
| need_contain_labels | bool | No | If True (default), filters out examples where all labels are -100. |
Outputs
tokenize():
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | The input dataset with added columns: input_ids (list[int]), input_ids_prompt (list[int]), attention_mask (list[int]), labels (list[int] with prompt tokens set to -100).
|
filter():
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A filtered dataset with examples exceeding length limits or lacking valid labels removed. |
Usage Examples
Basic Usage
from open_instruct.dataset_processor import SFTDatasetProcessor, DatasetConfig
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")
config = DatasetConfig(
max_token_length=2048,
max_prompt_token_length=1024,
)
processor = SFTDatasetProcessor(tokenizer=tokenizer, config=config)
# Tokenize: adds input_ids, attention_mask, labels columns
tokenized_dataset = processor.tokenize(raw_dataset)
# Filter: remove examples exceeding length limits
filtered_dataset = processor.filter(tokenized_dataset)