| Field |
Value
|
| Sources |
BIPIA repository
|
| Domains |
NLP, Tokenization, Defense
|
| Last Updated |
2026-02-14
|
Overview
Concrete tool for tokenizing defense training conversations with special token insertion and label masking provided by the BIPIA defense module.
Description
The tokenize_fn() inner function splits the user prompt around the context boundary, optionally inserts <data>/</data> tokens, and constructs the full Vicuna chat format sequence. It tokenizes each segment separately (using tokenizer.tokenize with is_split_into_words=True) to maintain precise control over token boundaries. Labels are created by copying input_ids and replacing all tokens before the assistant response with IGNORE_TOKEN_ID. Sequences longer than model_max_length are filtered out via dataset.filter().
Usage
Applied as a dataset.map() transformation in load_bipia_supervised_data_module().
Code Reference
| Attribute |
Detail
|
| Source |
BIPIA repo
|
| File |
defense/white_box/finetune.py
|
| Lines |
L397-473
|
| Signature |
def tokenize_fn(example: dict) -> dict (inner function, captures tokenizer and add_special_context_token from closure)
|
| Import |
Internal function in defense/white_box/finetune.py
|
I/O Contract
Inputs
| Parameter |
Type |
Required |
Description
|
example |
dict |
Yes |
Contains "conversation" with (user_prompt, response) and a "context" column
|
Outputs
| Key |
Type |
Description
|
input_ids |
List[int] |
Token IDs in Vicuna format: [BOS] system USER: <data>context</data> question ASSISTANT: response [EOS]
|
attention_mask |
List[int] |
All 1s for non-padded tokens
|
labels |
List[int] |
IGNORE_TOKEN_ID (-100) for all non-response tokens; actual token IDs for response tokens
|
Usage Examples
# Inside load_bipia_supervised_data_module():
def tokenize_fn(example: dict) -> dict:
conversation = example["conversation"]
user_prompt, response = conversation[0], conversation[1]
context = example["context"]
# Optionally wrap context with special tokens
if add_special_context_token:
user_prompt = user_prompt.replace(context, "<data>" + context + "</data>")
# Build Vicuna-format sequence and tokenize each segment
# ...
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
}
# Apply as map/filter pipeline on the dataset
dataset = dataset.map(tokenize_fn)
dataset = dataset.filter(lambda x: len(x["input_ids"]) <= tokenizer.model_max_length)
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.