Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct SFTDatasetProcessor

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Data Engineering
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for tokenizing and filtering SFT (Supervised Fine-Tuning) datasets provided by the Open Instruct library.

Description

The SFTDatasetProcessor class extends the base DatasetProcessor and provides two key methods: tokenize() and filter(). The tokenize() method applies the tokenizer's chat template to convert conversations into token IDs, constructs attention masks, and creates labels with prompt tokens masked to -100. The filter() method removes examples that exceed configurable length limits or that contain no valid training labels (all -100). Both methods use parallelized dataset.map() and dataset.filter() with automatic CPU count scaling.

Usage

Use SFTDatasetProcessor when you need fine-grained control over the tokenization and filtering of SFT datasets. It is used internally by the dataset transformation pipeline but can also be called directly for custom processing workflows.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dataset_processor.py
  • Lines: L278-322

Signature

class SFTDatasetProcessor(DatasetProcessor):
    def tokenize(self, dataset: Dataset) -> Dataset:
        """Apply chat template, create input_ids, attention_mask, and labels with prompt masking."""
        ...

    def filter(self, dataset: Dataset, need_contain_labels: bool = True) -> Dataset:
        """Filter out examples exceeding length limits or lacking valid labels."""
        ...

The parent class DatasetProcessor is initialized with:

class DatasetProcessor:
    def __init__(self, tokenizer: PreTrainedTokenizer, config: DatasetConfig) -> None:
        self.tokenizer = tokenizer
        self.config = config

Import

from open_instruct.dataset_processor import SFTDatasetProcessor

I/O Contract

Inputs

Constructor (inherited from DatasetProcessor):

Name Type Required Description
tokenizer PreTrainedTokenizer Yes A HuggingFace tokenizer with a configured chat template.
config DatasetConfig Yes Configuration dataclass controlling filtering thresholds, parallelism, and column names.

tokenize() method:

Name Type Required Description
dataset Dataset Yes A HuggingFace Dataset containing a messages column (default key: "messages").

filter() method:

Name Type Required Description
dataset Dataset Yes A tokenized HuggingFace Dataset with input_ids, input_ids_prompt, and labels columns.
need_contain_labels bool No If True (default), filters out examples where all labels are -100.

Outputs

tokenize():

Name Type Description
dataset Dataset The input dataset with added columns: input_ids (list[int]), input_ids_prompt (list[int]), attention_mask (list[int]), labels (list[int] with prompt tokens set to -100).

filter():

Name Type Description
dataset Dataset A filtered dataset with examples exceeding length limits or lacking valid labels removed.

Usage Examples

Basic Usage

from open_instruct.dataset_processor import SFTDatasetProcessor, DatasetConfig
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/Llama-3.1-Tulu-3-8B")
config = DatasetConfig(
    max_token_length=2048,
    max_prompt_token_length=1024,
)

processor = SFTDatasetProcessor(tokenizer=tokenizer, config=config)

# Tokenize: adds input_ids, attention_mask, labels columns
tokenized_dataset = processor.tokenize(raw_dataset)

# Filter: remove examples exceeding length limits
filtered_dataset = processor.filter(tokenized_dataset)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment