Implementation:Alibaba ROLL SFT Get Encode Function
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Supervised_Learning |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete SFT data encoding and collation functions provided by the Alibaba ROLL library.
Description
The get_encode_function creates a callable that encodes instruction-response pairs with label masking. DataCollatorForSFT handles padding with label shifting for causal LM training.
Usage
Called during SFT pipeline initialization.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/sft/sft_pipeline.py
- Lines: L26-81
Signature
def get_encode_function(
template_name: str,
tokenizer,
prompt_key: str,
query_key: Optional[str],
response_key: str,
system_key: str = None
) -> Callable:
"""
Create encoding function for SFT data with label masking.
Args:
template_name: Chat template name
tokenizer: Tokenizer instance
prompt_key: Dataset key for prompts/instructions
query_key: Optional query key
response_key: Dataset key for responses
system_key: Optional system prompt key
Returns:
Callable encoding function
"""
@dataclass
class DataCollatorForSFT:
label_pad_token_id: int = -100
shift_feature: bool = True
def __call__(self, features: List[Dict]) -> Dict[str, Any]:
"""Pad and shift labels for causal LM training."""
Import
from roll.pipeline.sft.sft_pipeline import get_encode_function
from roll.datasets.collator import DataCollatorForSFT
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | datasets.Dataset | Yes | Instruction-response dataset |
| tokenizer | PreTrainedTokenizer | Yes | Model tokenizer |
Outputs
| Name | Type | Description |
|---|---|---|
| Processed dataset | datasets.Dataset | Dataset with input_ids, attention_mask, labels (prompt masked with -100) |
Usage Examples
from roll.pipeline.sft.sft_pipeline import get_encode_function, preprocess_dataset
encode_fn = get_encode_function("qwen2_5", tokenizer, "instruction", None, "output")
processed = preprocess_dataset(dataset, prompt_len=2048, encode_func=encode_fn, num_proc=8)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
No specific heuristics apply to this implementation.
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment