Implementation:Online ml River Preprocessing OneHotEncoder
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Preprocessing, Categorical_Encoding |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Online one-hot encoding for categorical features with support for both streaming and mini-batch processing.
Description
OneHotEncoder converts categorical variables into binary indicator features. Each unique value encountered becomes a separate binary feature with value 1 when present, 0 otherwise. The encoder learns the vocabulary incrementally as new categories are observed. It supports encoding lists or sets within a single feature, drop_zeros mode to exclude absent features, and drop_first to create k-1 dummies from k categories (useful for avoiding multicollinearity). Works with both single observations and pandas DataFrames for batch processing.
Usage
Use this to prepare categorical features for linear models or algorithms that require numeric inputs. Essential for handling string-valued features in machine learning pipelines. The drop_zeros parameter reduces memory when dealing with many categories. The drop_first parameter helps prevent perfect collinearity in statistical models. Combine with compose.Select to encode specific features while preserving others.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/preprocessing/one_hot.py
Signature
class OneHotEncoder(base.MiniBatchTransformer):
def __init__(self, drop_zeros=False, drop_first=False)
Import
from river import preprocessing
I/O Contract
| Input | Output |
|---|---|
| Dict[str, Union[str, List, Set]] - Categorical features | Dict[str, int] - Binary indicator features |
Usage Examples
from pprint import pprint
import random
import string
random.seed(42)
alphabet = list(string.ascii_lowercase)
X = [
{
'c1': random.choice(alphabet),
'c2': random.choice(alphabet),
}
for _ in range(4)
]
from river import preprocessing
oh = preprocessing.OneHotEncoder()
for x in X[:2]:
oh.learn_one(x)
pprint(oh.transform_one(x))
# {'c1_u': 1, 'c2_d': 1}
# {'c1_a': 1, 'c1_u': 0, 'c2_d': 0, 'c2_x': 1}
# With drop_zeros enabled
oh = preprocessing.OneHotEncoder(drop_zeros=True)
for x in X:
oh.learn_one(x)
pprint(oh.transform_one(x))
# {'c1_u': 1, 'c2_d': 1}
# {'c1_a': 1, 'c2_x': 1}
# {'c1_i': 1, 'c2_h': 1}
# {'c1_h': 1, 'c2_e': 1}
# Using compose.Select to encode specific features
from river import compose
pp = compose.Select('c1') | preprocessing.OneHotEncoder()
pp += compose.Select('c2')
for x in X:
pp.learn_one(x)
pprint(pp.transform_one(x))
# {'c1_u': 1, 'c2': 'd'}
# {'c1_a': 1, 'c1_u': 0, 'c2': 'x'}
# {'c1_a': 0, 'c1_i': 1, 'c1_u': 0, 'c2': 'h'}
# {'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0, 'c2': 'e'}