Implementation:Microsoft BIPIA Tokenize Fn

Field	Value
Sources	BIPIA repository
Domains	NLP, Tokenization, Defense
Last Updated	2026-02-14

Overview

Concrete tool for tokenizing defense training conversations with special token insertion and label masking provided by the BIPIA defense module.

Description

The tokenize_fn() inner function splits the user prompt around the context boundary, optionally inserts <data>/</data> tokens, and constructs the full Vicuna chat format sequence. It tokenizes each segment separately (using tokenizer.tokenize with is_split_into_words=True) to maintain precise control over token boundaries. Labels are created by copying input_ids and replacing all tokens before the assistant response with IGNORE_TOKEN_ID. Sequences longer than model_max_length are filtered out via dataset.filter().

Usage

Applied as a dataset.map() transformation in load_bipia_supervised_data_module().

Code Reference

Attribute	Detail
Source	BIPIA repo
File	`defense/white_box/finetune.py`
Lines	L397-473
Signature	`def tokenize_fn(example: dict) -> dict` (inner function, captures `tokenizer` and `add_special_context_token` from closure)
Import	Internal function in `defense/white_box/finetune.py`

I/O Contract

Inputs

Parameter	Type	Required	Description
`example`	`dict`	Yes	Contains `"conversation"` with `(user_prompt, response)` and a `"context"` column

Outputs

Key	Type	Description
`input_ids`	`List[int]`	Token IDs in Vicuna format: `[BOS] system USER: <data>context</data> question ASSISTANT: response [EOS]`
`attention_mask`	`List[int]`	All 1s for non-padded tokens
`labels`	`List[int]`	`IGNORE_TOKEN_ID` (`-100`) for all non-response tokens; actual token IDs for response tokens

Usage Examples

# Inside load_bipia_supervised_data_module():

def tokenize_fn(example: dict) -> dict:
    conversation = example["conversation"]
    user_prompt, response = conversation[0], conversation[1]
    context = example["context"]

    # Optionally wrap context with special tokens
    if add_special_context_token:
        user_prompt = user_prompt.replace(context, "<data>" + context + "</data>")

    # Build Vicuna-format sequence and tokenize each segment
    # ...
    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

# Apply as map/filter pipeline on the dataset
dataset = dataset.map(tokenize_fn)
dataset = dataset.filter(lambda x: len(x["input_ids"]) <= tokenizer.model_max_length)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment