Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft BIPIA Tokenize Fn

From Leeroopedia
Field Value
Sources BIPIA repository
Domains NLP, Tokenization, Defense
Last Updated 2026-02-14

Overview

Concrete tool for tokenizing defense training conversations with special token insertion and label masking provided by the BIPIA defense module.

Description

The tokenize_fn() inner function splits the user prompt around the context boundary, optionally inserts <data>/</data> tokens, and constructs the full Vicuna chat format sequence. It tokenizes each segment separately (using tokenizer.tokenize with is_split_into_words=True) to maintain precise control over token boundaries. Labels are created by copying input_ids and replacing all tokens before the assistant response with IGNORE_TOKEN_ID. Sequences longer than model_max_length are filtered out via dataset.filter().

Usage

Applied as a dataset.map() transformation in load_bipia_supervised_data_module().

Code Reference

Attribute Detail
Source BIPIA repo
File defense/white_box/finetune.py
Lines L397-473
Signature def tokenize_fn(example: dict) -> dict (inner function, captures tokenizer and add_special_context_token from closure)
Import Internal function in defense/white_box/finetune.py

I/O Contract

Inputs

Parameter Type Required Description
example dict Yes Contains "conversation" with (user_prompt, response) and a "context" column

Outputs

Key Type Description
input_ids List[int] Token IDs in Vicuna format: [BOS] system USER: <data>context</data> question ASSISTANT: response [EOS]
attention_mask List[int] All 1s for non-padded tokens
labels List[int] IGNORE_TOKEN_ID (-100) for all non-response tokens; actual token IDs for response tokens

Usage Examples

# Inside load_bipia_supervised_data_module():

def tokenize_fn(example: dict) -> dict:
    conversation = example["conversation"]
    user_prompt, response = conversation[0], conversation[1]
    context = example["context"]

    # Optionally wrap context with special tokens
    if add_special_context_token:
        user_prompt = user_prompt.replace(context, "<data>" + context + "</data>")

    # Build Vicuna-format sequence and tokenize each segment
    # ...
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

# Apply as map/filter pipeline on the dataset
dataset = dataset.map(tokenize_fn)
dataset = dataset.filter(lambda x: len(x["input_ids"]) <= tokenizer.model_max_length)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment