Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples GLUE Classifier BERT Large

From Leeroopedia
Revision as of 15:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_GLUE_Classifier_BERT_Large.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Fine-tuning
Last Updated 2026-02-07 12:00 GMT

Overview

BERT-large fine-tuning script for all nine GLUE benchmark tasks with DeepSpeed distributed training and checkpoint management for long-running training.

Description

run_glue_classifier_bert_large.py extends the BERT-base GLUE classifier with additional checkpoint management functionality specifically needed for the larger BERT-large model. Like the base version, it implements data processors for all nine GLUE tasks (MRPC, MNLI, CoLA, SST-2, STS-B, QQP, QNLI, RTE, WNLI) and supports DeepSpeed distributed training.

The key addition over the BERT-base version is the checkpoint management system via checkpoint_model() and load_checkpoint() functions. The checkpoint_model() function saves model state through DeepSpeed's model.save_checkpoint() method, persisting the current epoch, global step count, and global data sample count. The load_checkpoint() function restores training state from a previous checkpoint, enabling resumable training which is critical for BERT-large where training runs take significantly longer.

The training pipeline follows the same structure as the base version: data loading with task-specific processors, feature extraction with WordPiece tokenization, DeepSpeed-wrapped training with BertAdam optimizer and warmup scheduling, and task-specific evaluation metrics. The script also supports FocalLoss for class-imbalanced tasks and integrates with the pytorch_pretrained_bert library.

Usage

Use this script to fine-tune BERT-large on GLUE benchmark tasks with DeepSpeed. The checkpoint management makes it suitable for long-running training jobs that may need to be interrupted and resumed.

Code Reference

Source Location

Signature

def checkpoint_model(PATH, ckpt_id, model, epoch, last_global_step,
                     last_global_data_samples, **kwargs):
    ...

def load_checkpoint(model, PATH, ckpt_id):
    ...

class InputExample(object):
    def __init__(self, guid, text_a, text_b=None, label=None):
        ...

class InputFeatures(object):
    def __init__(self, input_ids, input_mask, segment_ids, label_id):
        ...

class DataProcessor(object):
    def get_train_examples(self, data_dir): ...
    def get_dev_examples(self, data_dir): ...
    def get_labels(self): ...

def convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode):
    ...

def compute_metrics(task_name, preds, labels):
    ...

def main():
    ...

Import

# This is a standalone training script, run via DeepSpeed launcher:
# deepspeed run_glue_classifier_bert_large.py --deepspeed_config ds_config.json ...

I/O Contract

Inputs

Name Type Required Description
--data_dir str Yes Directory containing GLUE task TSV data files
--bert_model str Yes Pretrained BERT model name (e.g., bert-large-uncased)
--task_name str Yes GLUE task name: mrpc, mnli, cola, sst-2, sts-b, qqp, qnli, rte, wnli
--output_dir str Yes Directory for model predictions and checkpoints
--max_seq_length int No Maximum tokenized sequence length (default: 128)
--do_train flag No Run training phase
--do_eval flag No Run evaluation phase
--train_batch_size int No Training batch size (default: 32)
--learning_rate float No Initial learning rate for Adam (default: 5e-5)
--num_train_epochs float No Number of training epochs (default: 3.0)
--local_rank int No Local rank for distributed training (default: -1)

Outputs

Name Type Description
eval_results.txt file Evaluation metrics per task (accuracy, F1, MCC, or correlation)
model checkpoint directory DeepSpeed checkpoint with model state, epoch, global step, and data sample count
training logs stdout Training loss, checkpoint status, and evaluation results

Usage Examples

Fine-tune BERT-large on SST-2 with Checkpointing

# Launch with DeepSpeed for SST-2 sentiment analysis
deepspeed run_glue_classifier_bert_large.py \
    --deepspeed_config ds_config.json \
    --data_dir /data/glue/SST-2 \
    --bert_model bert-large-uncased \
    --task_name sst-2 \
    --output_dir /output/sst2 \
    --do_train \
    --do_eval \
    --do_lower_case \
    --max_seq_length 128 \
    --train_batch_size 16 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment