Implementation:Allenai Open instruct SFTDatasetProcessor

Knowledge Sources	Open Instruct
Domains	Machine Learning, Natural Language Processing, Data Engineering
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for tokenizing and filtering SFT (Supervised Fine-Tuning) datasets provided by the Open Instruct library.

Description

The SFTDatasetProcessor class extends the base DatasetProcessor and provides two key methods: tokenize() and filter(). The tokenize() method applies the tokenizer's chat template to convert conversations into token IDs, constructs attention masks, and creates labels with prompt tokens masked to -100. The filter() method removes examples that exceed configurable length limits or that contain no valid training labels (all -100). Both methods use parallelized dataset.map() and dataset.filter() with automatic CPU count scaling.

Usage

Use SFTDatasetProcessor when you need fine-grained control over the tokenization and filtering of SFT datasets. It is used internally by the dataset transformation pipeline but can also be called directly for custom processing workflows.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/dataset_processor.py
Lines: L278-322

Signature

class SFTDatasetProcessor(DatasetProcessor):
    def tokenize(self, dataset: Dataset) -> Dataset:
        """Apply chat template, create input_ids, attention_mask, and labels with prompt masking."""
        ...

    def filter(self, dataset: Dataset, need_contain_labels: bool = True) -> Dataset:
        """Filter out examples exceeding length limits or lacking valid labels."""
        ...

The parent class DatasetProcessor is initialized with:

class DatasetProcessor:
    def __init__(self, tokenizer: PreTrainedTokenizer, config: DatasetConfig) -> None:
        self.tokenizer = tokenizer
        self.config = config

Import

from open_instruct.dataset_processor import SFTDatasetProcessor

I/O Contract

Inputs

Constructor (inherited from DatasetProcessor):

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	A HuggingFace tokenizer with a configured chat template.
config	DatasetConfig	Yes	Configuration dataclass controlling filtering thresholds, parallelism, and column names.

tokenize() method:

Name	Type	Required	Description
dataset	Dataset	Yes	A HuggingFace Dataset containing a messages column (default key: `"messages"`).

filter() method:

Name	Type	Required	Description
dataset	Dataset	Yes	A tokenized HuggingFace Dataset with `input_ids`, `input_ids_prompt`, and `labels` columns.
need_contain_labels	bool	No	If True (default), filters out examples where all labels are -100.

Outputs

tokenize():

Name	Type	Description
dataset	Dataset	The input dataset with added columns: `input_ids` (list[int]), `input_ids_prompt` (list[int]), `attention_mask` (list[int]), `labels` (list[int] with prompt tokens set to -100).

filter():

Name	Type	Description
dataset	Dataset	A filtered dataset with examples exceeding length limits or lacking valid labels removed.

Usage Examples

Basic Usage

from open_instruct.dataset_processor import SFTDatasetProcessor, DatasetConfig
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")
config = DatasetConfig(
    max_token_length=2048,
    max_prompt_token_length=1024,
)

processor = SFTDatasetProcessor(tokenizer=tokenizer, config=config)

# Tokenize: adds input_ids, attention_mask, labels columns
tokenized_dataset = processor.tokenize(raw_dataset)

# Filter: remove examples exceeding length limits
filtered_dataset = processor.filter(tokenized_dataset)

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_SFT_Data_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment