Implementation:Alibaba ROLL Distill Preprocess Dataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Knowledge_Distillation |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Concrete distillation data preprocessing functions provided by the Alibaba ROLL library.
Description
The preprocess_dataset and get_encode_function in the distillation pipeline tokenize instruction-response data with configurable prompt masking via the distill_on_prompt parameter.
Usage
Called during DistillPipeline initialization.
Code Reference
Source Location
- Repository: Alibaba ROLL
- File: roll/pipeline/distill/distill_pipeline.py
- Lines: L42-153
Signature
def preprocess_dataset(dataset, tokenizer, pipeline_config: DistillConfig) -> datasets.Dataset:
"""Preprocess dataset for distillation."""
def get_encode_function(
template_name, tokenizer, prompt_key, question_key, answer_key,
system_key, distill_on_prompt, sequence_length
) -> Callable:
"""Create encoding function with optional prompt inclusion."""
Import
from roll.pipeline.distill.distill_pipeline import preprocess_dataset, get_encode_function
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | datasets.Dataset | Yes | Instruction-response dataset |
| pipeline_config | DistillConfig | Yes | Config with distill_on_prompt setting |
Outputs
| Name | Type | Description |
|---|---|---|
| Processed dataset | datasets.Dataset | Dataset with input_ids, attention_mask, labels |
Usage Examples
processed = preprocess_dataset(dataset, tokenizer, distill_config)
Related Pages
Implements Principle
Requires Environment
Environment Dependencies
This implementation requires the following environment constraints:
Heuristics Applied
No specific heuristics apply to this implementation.
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment