Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL SFT Get Encode Function

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Supervised_Learning
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete SFT data encoding and collation functions provided by the Alibaba ROLL library.

Description

The get_encode_function creates a callable that encodes instruction-response pairs with label masking. DataCollatorForSFT handles padding with label shifting for causal LM training.

Usage

Called during SFT pipeline initialization.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/sft/sft_pipeline.py
  • Lines: L26-81

Signature

def get_encode_function(
    template_name: str,
    tokenizer,
    prompt_key: str,
    query_key: Optional[str],
    response_key: str,
    system_key: str = None
) -> Callable:
    """
    Create encoding function for SFT data with label masking.

    Args:
        template_name: Chat template name
        tokenizer: Tokenizer instance
        prompt_key: Dataset key for prompts/instructions
        query_key: Optional query key
        response_key: Dataset key for responses
        system_key: Optional system prompt key

    Returns:
        Callable encoding function
    """

@dataclass
class DataCollatorForSFT:
    label_pad_token_id: int = -100
    shift_feature: bool = True

    def __call__(self, features: List[Dict]) -> Dict[str, Any]:
        """Pad and shift labels for causal LM training."""

Import

from roll.pipeline.sft.sft_pipeline import get_encode_function
from roll.datasets.collator import DataCollatorForSFT

I/O Contract

Inputs

Name Type Required Description
dataset datasets.Dataset Yes Instruction-response dataset
tokenizer PreTrainedTokenizer Yes Model tokenizer

Outputs

Name Type Description
Processed dataset datasets.Dataset Dataset with input_ids, attention_mask, labels (prompt masked with -100)

Usage Examples

from roll.pipeline.sft.sft_pipeline import get_encode_function, preprocess_dataset

encode_fn = get_encode_function("qwen2_5", tokenizer, "instruction", None, "output")
processed = preprocess_dataset(dataset, prompt_len=2048, encode_func=encode_fn, num_proc=8)

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

No specific heuristics apply to this implementation.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment