Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA Utils Multiple Choice

From Leeroopedia
Revision as of 15:44, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_LoRA_Utils_Multiple_Choice.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Implementation metadata

Overview

utils_multiple_choice.py provides data processor classes, dataset wrappers, and feature conversion utilities for multiple choice reading comprehension tasks including RACE, SWAG, ARC, and Synonym.

Description

This utility module defines the data pipeline infrastructure for multiple choice fine-tuning. It includes:

  • InputExample: A frozen dataclass representing a single training/test example with fields for example_id, question, contexts (list of context strings), endings (list of choice strings), and an optional label.
  • InputFeatures: A frozen dataclass holding tokenized features (input_ids, attention_mask, token_type_ids) as nested lists (one per choice), plus the integer label.
  • DataProcessor: An abstract base class defining the interface get_train_examples(), get_dev_examples(), get_test_examples(), and get_labels().
  • RaceProcessor: Reads RACE dataset from hierarchical train/high, train/middle directory structures of JSON text files. Supports 4-way choices.
  • SwagProcessor: Reads SWAG dataset from CSV files. Maps sent2 + ending columns to 4 choices.
  • ArcProcessor: Reads ARC dataset from JSONL files. Normalizes labels from letter (A-D) or number (1-4) formats. Filters to 4-choice questions only.
  • SynonymProcessor: Reads synonym multiple choice from CSV. Supports 5-way choices.
  • MultipleChoiceDataset (PyTorch): A torch.utils.data.Dataset that loads examples through a processor, converts to features, and caches them using FileLock for distributed training safety.
  • TFMultipleChoiceDataset (TensorFlow): Equivalent TF dataset using tf.data.Dataset.from_generator.
  • convert_examples_to_features(): Core function that tokenizes context/question/ending combinations and produces InputFeatures with cloze-style question support (replacing _ placeholder with the ending).

The processors registry maps task names to processor classes: {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}.

Usage

Use this module when you need to:

  • Process RACE, SWAG, ARC, or Synonym datasets for multiple choice training
  • Build custom multiple choice data pipelines following the DataProcessor pattern
  • Convert raw text examples into tokenized features with framework-specific dataset wrappers

Code Reference

Source Location

Property Value
File examples/NLU/examples/multiple-choice/utils_multiple_choice.py
Lines 579
Module utils_multiple_choice

Signature/CLI

# Key classes and functions
class InputExample(example_id, question, contexts, endings, label)
class InputFeatures(example_id, input_ids, attention_mask, token_type_ids, label)
class DataProcessor  # Abstract base
class RaceProcessor(DataProcessor)
class SwagProcessor(DataProcessor)
class ArcProcessor(DataProcessor)
class SynonymProcessor(DataProcessor)
class MultipleChoiceDataset(Dataset)  # PyTorch
class TFMultipleChoiceDataset  # TensorFlow

def convert_examples_to_features(
    examples: List[InputExample],
    label_list: List[str],
    max_length: int,
    tokenizer: PreTrainedTokenizer,
) -> List[InputFeatures]

processors = {"race": RaceProcessor, "swag": SwagProcessor, "arc": ArcProcessor, "syn": SynonymProcessor}

Import

from utils_multiple_choice import (
    MultipleChoiceDataset,
    Split,
    processors,
    convert_examples_to_features,
    InputExample,
    InputFeatures,
)

I/O Contract

Inputs

Parameter Type Required Default Description
data_dir str Yes - Directory containing task-specific data files
tokenizer PreTrainedTokenizer Yes - Tokenizer for encoding text pairs
task str Yes - Task key: "race", "swag", "arc", or "syn"
max_seq_length int No None Max sequence length for tokenization
overwrite_cache bool No False Force re-processing of cached features
mode Split No Split.train One of Split.train, Split.dev, Split.test

Outputs

Output Type Description
Features list List[InputFeatures] Tokenized features with input_ids shaped [num_choices, seq_len]
Cached features Binary file cached_{mode}_{tokenizer}_{max_len}_{task} in data_dir
Dataset (PyTorch) MultipleChoiceDataset Indexable dataset returning InputFeatures
Dataset (TF) tf.data.Dataset Generator-based TF dataset with cardinality assertion

Usage Examples

Load SWAG training dataset

from transformers import AutoTokenizer
from utils_multiple_choice import MultipleChoiceDataset, Split

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
train_dataset = MultipleChoiceDataset(
    data_dir="/path/to/swag_data",
    tokenizer=tokenizer,
    task="swag",
    max_seq_length=128,
    mode=Split.train,
)
print(f"Number of training examples: {len(train_dataset)}")
print(f"First feature: {train_dataset[0]}")

Convert examples to features directly

from utils_multiple_choice import (
    InputExample,
    convert_examples_to_features,
    processors,
)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
processor = processors["race"]()
examples = processor.get_train_examples("/path/to/RACE")
label_list = processor.get_labels()  # ["0", "1", "2", "3"]

features = convert_examples_to_features(
    examples=examples,
    label_list=label_list,
    max_length=512,
    tokenizer=tokenizer,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment