Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Distill Preprocess Dataset

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Knowledge_Distillation
Last Updated 2026-02-07 20:00 GMT

Overview

Concrete distillation data preprocessing functions provided by the Alibaba ROLL library.

Description

The preprocess_dataset and get_encode_function in the distillation pipeline tokenize instruction-response data with configurable prompt masking via the distill_on_prompt parameter.

Usage

Called during DistillPipeline initialization.

Code Reference

Source Location

  • Repository: Alibaba ROLL
  • File: roll/pipeline/distill/distill_pipeline.py
  • Lines: L42-153

Signature

def preprocess_dataset(dataset, tokenizer, pipeline_config: DistillConfig) -> datasets.Dataset:
    """Preprocess dataset for distillation."""

def get_encode_function(
    template_name, tokenizer, prompt_key, question_key, answer_key,
    system_key, distill_on_prompt, sequence_length
) -> Callable:
    """Create encoding function with optional prompt inclusion."""

Import

from roll.pipeline.distill.distill_pipeline import preprocess_dataset, get_encode_function

I/O Contract

Inputs

Name Type Required Description
dataset datasets.Dataset Yes Instruction-response dataset
pipeline_config DistillConfig Yes Config with distill_on_prompt setting

Outputs

Name Type Description
Processed dataset datasets.Dataset Dataset with input_ids, attention_mask, labels

Usage Examples

processed = preprocess_dataset(dataset, tokenizer, distill_config)

Related Pages

Implements Principle

Requires Environment

Environment Dependencies

This implementation requires the following environment constraints:

Heuristics Applied

No specific heuristics apply to this implementation.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment