Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets MovieLens100K

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets, Regression, Recommender_Systems
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for regression and recommender systems provided by the River library.

Description

MovieLens 100K dataset. MovieLens datasets were collected by the GroupLens Research Project at the University of Minnesota. This dataset consists of 100,000 ratings (1-5) from 943 users on 1682 movies. Each user has rated at least 20 movies. User and movie information are provided. The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998.

This dataset contains 100,000 samples with 10 features for regression tasks (rating prediction).

Usage

This dataset is useful for:

  • Collaborative filtering and recommender systems
  • Rating prediction tasks
  • Personalization algorithms
  • Matrix factorization techniques

Code Reference

Source Location

Signature

class MovieLens100K(base.RemoteDataset):
    def __init__(self, unpack_user_and_item=False):
        super().__init__(
            n_samples=100_000,
            n_features=10,
            task=base.REG,
            url="https://maxhalford.github.io/files/datasets/ml_100k.zip",
            size=11_057_876,
            filename="ml_100k.csv",
        )
        self.unpack_user_and_item = unpack_user_and_item

    def _iter(self):
        X_y = stream.iter_csv(
            self.path,
            target="rating",
            converters={
                "timestamp": int,
                "release_date": int,
                "age": float,
                "rating": float,
            },
            delimiter="\t",
        )
        if self.unpack_user_and_item:
            for x, y in X_y:
                user = x.pop("user")
                item = x.pop("item")
                yield x, y, {"user": user, "item": item}
        else:
            yield from X_y

Import

from river import datasets
dataset = datasets.MovieLens100K()
# Or with unpacked user/item:
dataset = datasets.MovieLens100K(unpack_user_and_item=True)

I/O Contract

Inputs

Name Type Required Description
unpack_user_and_item bool No Whether to extract user and item as extra kwargs (default: False)

Outputs

Name Type Description
iter() (default) tuple(dict, float) Yields (features_dict, rating) pairs
iter() (unpacked) tuple(dict, float, dict) Yields (features_dict, rating, {"user": user, "item": item})

Dataset Properties

Property Value
Number of samples 100,000
Number of features 10
Task Regression (rating prediction)
Format CSV (tab-delimited)
Size 11,057,876 bytes
Number of users 943
Number of items 1,682
Rating scale 1-5

Features

The dataset includes features about:

  • User information (user ID, age, demographics)
  • Movie information (movie ID, release date, genre)
  • Interaction data (timestamp)
  • rating: User rating of the movie (target variable, float 1-5)

Usage Examples

from river import datasets

# Standard usage
dataset = datasets.MovieLens100K()
for x, y in dataset:
    print(x, y)
    break

# With user and item unpacked
dataset = datasets.MovieLens100K(unpack_user_and_item=True)
for x, y, extra in dataset:
    print(f"Features: {x}")
    print(f"Rating: {y}")
    print(f"User: {extra['user']}, Item: {extra['item']}")
    break

References

  • Harper, F.M. and Konstan, J.A., 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4), pp.1-19. [1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment