Implementation:Microsoft LoRA Utils Multiple Choice
Appearance
Template:Implementation metadata
Overview
utils_multiple_choice.py provides data processor classes, dataset wrappers, and feature conversion utilities for multiple choice reading comprehension tasks including RACE, SWAG, ARC, and Synonym.
Description
This utility module defines the data pipeline infrastructure for multiple choice fine-tuning. It includes:
InputExample: A frozen dataclass representing a single training/test example with fields forexample_id,question,contexts(list of context strings),endings(list of choice strings), and an optionallabel.InputFeatures: A frozen dataclass holding tokenized features (input_ids,attention_mask,token_type_ids) as nested lists (one per choice), plus the integer label.DataProcessor: An abstract base class defining the interfaceget_train_examples(),get_dev_examples(),get_test_examples(), andget_labels().RaceProcessor: Reads RACE dataset from hierarchicaltrain/high,train/middledirectory structures of JSON text files. Supports 4-way choices.SwagProcessor: Reads SWAG dataset from CSV files. Mapssent2+ ending columns to 4 choices.ArcProcessor: Reads ARC dataset from JSONL files. Normalizes labels from letter (A-D) or number (1-4) formats. Filters to 4-choice questions only.SynonymProcessor: Reads synonym multiple choice from CSV. Supports 5-way choices.MultipleChoiceDataset(PyTorch): Atorch.utils.data.Datasetthat loads examples through a processor, converts to features, and caches them usingFileLockfor distributed training safety.TFMultipleChoiceDataset(TensorFlow): Equivalent TF dataset usingtf.data.Dataset.from_generator.convert_examples_to_features(): Core function that tokenizes context/question/ending combinations and producesInputFeatureswith cloze-style question support (replacing_placeholder with the ending).
The processors registry maps task names to processor classes: {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}.
Usage
Use this module when you need to:
- Process RACE, SWAG, ARC, or Synonym datasets for multiple choice training
- Build custom multiple choice data pipelines following the
DataProcessorpattern - Convert raw text examples into tokenized features with framework-specific dataset wrappers
Code Reference
Source Location
| Property | Value |
|---|---|
| File | examples/NLU/examples/multiple-choice/utils_multiple_choice.py
|
| Lines | 579 |
| Module | utils_multiple_choice
|
Signature/CLI
# Key classes and functions
class InputExample(example_id, question, contexts, endings, label)
class InputFeatures(example_id, input_ids, attention_mask, token_type_ids, label)
class DataProcessor # Abstract base
class RaceProcessor(DataProcessor)
class SwagProcessor(DataProcessor)
class ArcProcessor(DataProcessor)
class SynonymProcessor(DataProcessor)
class MultipleChoiceDataset(Dataset) # PyTorch
class TFMultipleChoiceDataset # TensorFlow
def convert_examples_to_features(
examples: List[InputExample],
label_list: List[str],
max_length: int,
tokenizer: PreTrainedTokenizer,
) -> List[InputFeatures]
processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}
Import
from utils_multiple_choice import (
MultipleChoiceDataset,
Split,
processors,
convert_examples_to_features,
InputExample,
InputFeatures,
)
I/O Contract
Inputs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
data_dir |
str | Yes | - | Directory containing task-specific data files |
tokenizer |
PreTrainedTokenizer | Yes | - | Tokenizer for encoding text pairs |
task |
str | Yes | - | Task key: "race", "swag", "arc", or "syn"
|
max_seq_length |
int | No | None | Max sequence length for tokenization |
overwrite_cache |
bool | No | False | Force re-processing of cached features |
mode |
Split | No | Split.train | One of Split.train, Split.dev, Split.test
|
Outputs
| Output | Type | Description |
|---|---|---|
| Features list | List[InputFeatures] |
Tokenized features with input_ids shaped [num_choices, seq_len]
|
| Cached features | Binary file | cached_{mode}_{tokenizer}_{max_len}_{task} in data_dir
|
| Dataset (PyTorch) | MultipleChoiceDataset |
Indexable dataset returning InputFeatures
|
| Dataset (TF) | tf.data.Dataset |
Generator-based TF dataset with cardinality assertion |
Usage Examples
Load SWAG training dataset
from transformers import AutoTokenizer
from utils_multiple_choice import MultipleChoiceDataset, Split
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
train_dataset = MultipleChoiceDataset(
data_dir="/path/to/swag_data",
tokenizer=tokenizer,
task="swag",
max_seq_length=128,
mode=Split.train,
)
print(f"Number of training examples: {len(train_dataset)}")
print(f"First feature: {train_dataset[0]}")
Convert examples to features directly
from utils_multiple_choice import (
InputExample,
convert_examples_to_features,
processors,
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = processors["race"]()
examples = processor.get_train_examples("/path/to/RACE")
label_list = processor.get_labels() # ["0", "1", "2", "3"]
features = convert_examples_to_features(
examples=examples,
label_list=label_list,
max_length=512,
tokenizer=tokenizer,
)
Related Pages
- Environment:Microsoft_LoRA_NLU_Conda_Environment
- Implementation:Microsoft_LoRA_Run_SWAG - Modern SWAG fine-tuning script using Trainer API
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment