Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Preprocessing OneHotEncoder

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Preprocessing, Categorical_Encoding
Last Updated 2026-02-08 16:00 GMT

Overview

Online one-hot encoding for categorical features with support for both streaming and mini-batch processing.

Description

OneHotEncoder converts categorical variables into binary indicator features. Each unique value encountered becomes a separate binary feature with value 1 when present, 0 otherwise. The encoder learns the vocabulary incrementally as new categories are observed. It supports encoding lists or sets within a single feature, drop_zeros mode to exclude absent features, and drop_first to create k-1 dummies from k categories (useful for avoiding multicollinearity). Works with both single observations and pandas DataFrames for batch processing.

Usage

Use this to prepare categorical features for linear models or algorithms that require numeric inputs. Essential for handling string-valued features in machine learning pipelines. The drop_zeros parameter reduces memory when dealing with many categories. The drop_first parameter helps prevent perfect collinearity in statistical models. Combine with compose.Select to encode specific features while preserving others.

Code Reference

Source Location

Signature

class OneHotEncoder(base.MiniBatchTransformer):
    def __init__(self, drop_zeros=False, drop_first=False)

Import

from river import preprocessing

I/O Contract

Input Output
Dict[str, Union[str, List, Set]] - Categorical features Dict[str, int] - Binary indicator features

Usage Examples

from pprint import pprint
import random
import string

random.seed(42)
alphabet = list(string.ascii_lowercase)
X = [
    {
        'c1': random.choice(alphabet),
        'c2': random.choice(alphabet),
    }
    for _ in range(4)
]

from river import preprocessing

oh = preprocessing.OneHotEncoder()
for x in X[:2]:
    oh.learn_one(x)
    pprint(oh.transform_one(x))
# {'c1_u': 1, 'c2_d': 1}
# {'c1_a': 1, 'c1_u': 0, 'c2_d': 0, 'c2_x': 1}

# With drop_zeros enabled
oh = preprocessing.OneHotEncoder(drop_zeros=True)
for x in X:
    oh.learn_one(x)
    pprint(oh.transform_one(x))
# {'c1_u': 1, 'c2_d': 1}
# {'c1_a': 1, 'c2_x': 1}
# {'c1_i': 1, 'c2_h': 1}
# {'c1_h': 1, 'c2_e': 1}

# Using compose.Select to encode specific features
from river import compose

pp = compose.Select('c1') | preprocessing.OneHotEncoder()
pp += compose.Select('c2')

for x in X:
    pp.learn_one(x)
    pprint(pp.transform_one(x))
# {'c1_u': 1, 'c2': 'd'}
# {'c1_a': 1, 'c1_u': 0, 'c2': 'x'}
# {'c1_a': 0, 'c1_i': 1, 'c1_u': 0, 'c2': 'h'}
# {'c1_a': 0, 'c1_h': 1, 'c1_i': 0, 'c1_u': 0, 'c2': 'e'}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment