Implementation:Microsoft LoRA Utils Multiple Choice

Overview

utils_multiple_choice.py provides data processor classes, dataset wrappers, and feature conversion utilities for multiple choice reading comprehension tasks including RACE, SWAG, ARC, and Synonym.

Description

This utility module defines the data pipeline infrastructure for multiple choice fine-tuning. It includes:

InputExample: A frozen dataclass representing a single training/test example with fields for example_id, question, contexts (list of context strings), endings (list of choice strings), and an optional label.
InputFeatures: A frozen dataclass holding tokenized features (input_ids, attention_mask, token_type_ids) as nested lists (one per choice), plus the integer label.
DataProcessor: An abstract base class defining the interface get_train_examples(), get_dev_examples(), get_test_examples(), and get_labels().
RaceProcessor: Reads RACE dataset from hierarchical train/high, train/middle directory structures of JSON text files. Supports 4-way choices.
SwagProcessor: Reads SWAG dataset from CSV files. Maps sent2 + ending columns to 4 choices.
ArcProcessor: Reads ARC dataset from JSONL files. Normalizes labels from letter (A-D) or number (1-4) formats. Filters to 4-choice questions only.
SynonymProcessor: Reads synonym multiple choice from CSV. Supports 5-way choices.
MultipleChoiceDataset (PyTorch): A torch.utils.data.Dataset that loads examples through a processor, converts to features, and caches them using FileLock for distributed training safety.
TFMultipleChoiceDataset (TensorFlow): Equivalent TF dataset using tf.data.Dataset.from_generator.
convert_examples_to_features(): Core function that tokenizes context/question/ending combinations and produces InputFeatures with cloze-style question support (replacing _ placeholder with the ending).

The processors registry maps task names to processor classes: {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}.

Usage

Use this module when you need to:

Process RACE, SWAG, ARC, or Synonym datasets for multiple choice training
Build custom multiple choice data pipelines following the DataProcessor pattern
Convert raw text examples into tokenized features with framework-specific dataset wrappers

Code Reference

Source Location

Property	Value
File	`examples/NLU/examples/multiple-choice/utils_multiple_choice.py`
Lines	579
Module	`utils_multiple_choice`

Signature/CLI

# Key classes and functions
class InputExample(example_id, question, contexts, endings, label)
class InputFeatures(example_id, input_ids, attention_mask, token_type_ids, label)
class DataProcessor  # Abstract base
class RaceProcessor(DataProcessor)
class SwagProcessor(DataProcessor)
class ArcProcessor(DataProcessor)
class SynonymProcessor(DataProcessor)
class MultipleChoiceDataset(Dataset)  # PyTorch
class TFMultipleChoiceDataset  # TensorFlow

def convert_examples_to_features(
    examples: List[InputExample],
    label_list: List[str],
    max_length: int,
    tokenizer: PreTrainedTokenizer,
) -> List[InputFeatures]

processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}

Import

from utils_multiple_choice import (
    MultipleChoiceDataset,
    Split,
    processors,
    convert_examples_to_features,
    InputExample,
    InputFeatures,
)

I/O Contract

Inputs

Parameter	Type	Required	Default	Description
`data_dir`	str	Yes	-	Directory containing task-specific data files
`tokenizer`	PreTrainedTokenizer	Yes	-	Tokenizer for encoding text pairs
`task`	str	Yes	-	Task key: `"race"`, `"swag"`, `"arc"`, or `"syn"`
`max_seq_length`	int	No	None	Max sequence length for tokenization
`overwrite_cache`	bool	No	False	Force re-processing of cached features
`mode`	Split	No	Split.train	One of `Split.train`, `Split.dev`, `Split.test`

Outputs

Output	Type	Description
Features list	`List[InputFeatures]`	Tokenized features with input_ids shaped `[num_choices, seq_len]`
Cached features	Binary file	`cached_{mode}_{tokenizer}_{max_len}_{task}` in data_dir
Dataset (PyTorch)	`MultipleChoiceDataset`	Indexable dataset returning `InputFeatures`
Dataset (TF)	`tf.data.Dataset`	Generator-based TF dataset with cardinality assertion

Usage Examples

Load SWAG training dataset

from transformers import AutoTokenizer
from utils_multiple_choice import MultipleChoiceDataset, Split

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
train_dataset = MultipleChoiceDataset(
    data_dir="/path/to/swag_data",
    tokenizer=tokenizer,
    task="swag",
    max_seq_length=128,
    mode=Split.train,
)
print(f"Number of training examples: {len(train_dataset)}")
print(f"First feature: {train_dataset[0]}")

Convert examples to features directly

from utils_multiple_choice import (
    InputExample,
    convert_examples_to_features,
    processors,
)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = processors["race"]()
examples = processor.get_train_examples("/path/to/RACE")
label_list = processor.get_labels()  # ["0", "1", "2", "3"]

features = convert_examples_to_features(
    examples=examples,
    label_list=label_list,
    max_length=512,
    tokenizer=tokenizer,
)

Related Pages

Environment:Microsoft_LoRA_NLU_Conda_Environment
Implementation:Microsoft_LoRA_Run_SWAG - Modern SWAG fine-tuning script using Trainer API

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment