Overview
TextSentiment is a PyTorch nn.Module for text classification using a bag-of-embeddings architecture. It composes an nn.EmbeddingBag layer with a fully connected nn.Linear layer to classify text into one of 4 sentiment categories. The model is memory-efficient because EmbeddingBag computes mean embeddings on-the-fly without requiring padded input sequences.
Description
The TextSentiment class implements a lightweight text classification model with two layers: an nn.EmbeddingBag (sparse) for efficient bag-of-words embedding aggregation and a single nn.Linear projection to class logits. The default configuration uses a vocabulary of 1,308,843 tokens, 32-dimensional embeddings, and 4 output classes.
Key Responsibilities
- Embedding Aggregation: Uses
nn.EmbeddingBag with sparse=True to compute mean embeddings per sample without padding, using offsets to delimit variable-length sequences
- Classification: Projects aggregated embedding through a single
nn.Linear layer to produce class logits
- Weight Initialization: Uniform initialization in the range [-0.5, 0.5] for embedding and linear weights, with bias zeroed
Architecture
| Layer |
Type |
Input Dim |
Output Dim |
Notes
|
self.embedding |
nn.EmbeddingBag |
vocab_size (1,308,843) |
embed_dim (32) |
sparse=True, mode="mean" (default)
|
self.fc |
nn.Linear |
embed_dim (32) |
num_class (4) |
Final classification layer
|
Usage
from model import TextSentiment
# Create model with default parameters
model = TextSentiment()
# Or with custom parameters
model = TextSentiment(vocab_size=50000, embed_dim=64, num_class=5)
# Forward pass
# text: 1-D tensor of token indices (concatenated, no padding)
# offsets: tensor of starting indices for each sample in the batch
text = torch.tensor([1, 2, 3, 4, 5, 6])
offsets = torch.tensor([0, 3]) # Two samples: [1,2,3] and [4,5,6]
logits = model(text, offsets)
# logits shape: (2, 4)
Code Reference
Source Location
| File |
Lines |
Description
|
examples/text_classification/model.py |
L1-40 |
Full module (40 lines)
|
examples/text_classification/model.py |
L19-40 |
TextSentiment class definition
|
examples/text_classification/model.py |
L20-24 |
__init__(vocab_size, embed_dim, num_class) -- layer construction
|
examples/text_classification/model.py |
L26-30 |
init_weights() -- uniform initialization
|
examples/text_classification/model.py |
L32-39 |
forward(text, offsets) -- embedding + linear projection
|
Signature
class TextSentiment(nn.Module):
def __init__(self, vocab_size=1308843, embed_dim=32, num_class=4):
"""
Construct EmbeddingBag + Linear text classifier.
Args:
vocab_size (int): Size of the vocabulary. Default: 1,308,843.
embed_dim (int): Dimensionality of embeddings. Default: 32.
num_class (int): Number of output classes. Default: 4.
"""
...
def init_weights(self):
"""
Initialize weights uniformly in [-0.5, 0.5].
Sets embedding weights, linear weights to uniform(-0.5, 0.5)
and linear bias to zero.
"""
...
def forward(self, text, offsets):
"""
Forward pass: EmbeddingBag aggregation followed by linear projection.
Args:
text (Tensor): 1-D tensor of token indices (concatenated bag of
text tensors, no padding needed).
offsets (Tensor): 1-D tensor of offsets delimiting individual
sequences within the text tensor.
Returns:
Tensor: Class logits of shape (batch_size, num_class).
"""
...
Import
I/O Contract
| Method |
Input |
Output |
Notes
|
__init__(vocab_size, embed_dim, num_class) |
Default: 1308843, 32, 4 |
None |
Creates nn.EmbeddingBag(sparse=True) and nn.Linear; calls init_weights()
|
init_weights() |
None |
None |
uniform_(-0.5, 0.5) for embedding and FC weights; zero_() for FC bias
|
forward(text, offsets) |
text: 1-D Tensor of token IDs; offsets: 1-D Tensor of sample boundaries |
Tensor of shape (batch_size, num_class) |
No padding required; offsets delimit variable-length sequences
|
Usage Examples
Example 1: Model Construction and Weight Init
# From model.py L19-30: TextSentiment with EmbeddingBag
class TextSentiment(nn.Module):
def __init__(self, vocab_size=1308843, embed_dim=32, num_class=4):
super(TextSentiment, self).__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
Example 2: Forward Pass with Offsets
# From model.py L32-39: forward() uses EmbeddingBag + FC
def forward(self, text, offsets):
"""
Args:
text: 1-D tensor representing a bag of text tensors
offsets: a list of offsets to delimit the 1-D text tensor
into the individual sequences.
"""
return self.fc(self.embedding(text, offsets))
# Example usage with variable-length inputs:
import torch
model = TextSentiment()
# Three samples with lengths 2, 3, and 1
text = torch.tensor([10, 20, 30, 40, 50, 60])
offsets = torch.tensor([0, 2, 5]) # Sample boundaries
logits = model(text, offsets)
# logits.shape == (3, 4)
Example 3: Why EmbeddingBag Over Embedding
# nn.EmbeddingBag computes the mean of 'bags' of embeddings.
# Unlike nn.Embedding + mean(), it:
# 1. Requires no padding (uses offsets instead)
# 2. Accumulates the average on-the-fly
# 3. Is faster and more memory-efficient for variable-length sequences
#
# With nn.Embedding, you would need:
# padded_input = pad_sequence(sequences, batch_first=True)
# embeddings = embedding(padded_input) # (batch, max_len, embed_dim)
# mean_embeddings = embeddings.mean(dim=1)
#
# With nn.EmbeddingBag:
# mean_embeddings = embedding_bag(concatenated_tokens, offsets)
Related Pages