Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn FetchRcv1

From Leeroopedia
Revision as of 16:35, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Scikit_learn_Scikit_learn_FetchRcv1.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data Loading, Text Classification
Last Updated 2026-02-08 15:00 GMT

Overview

Concrete tool for fetching the RCV1 (Reuters Corpus Volume I) text classification dataset provided by scikit-learn.

Description

This module implements the fetch_rcv1 function that downloads and loads the RCV1 dataset, a benchmark corpus for text categorization research. The dataset contains over 800,000 newswire stories with topic labels from Reuters. The feature vectors are pre-computed TF-IDF representations stored in sparse format (loaded via svmlight format). The module supports loading train, test, or all subsets, with optional shuffling.

Usage

Use this function to load the RCV1 dataset for large-scale text classification experiments, evaluating multi-label classification algorithms, or benchmarking linear models on high-dimensional sparse data.

Code Reference

Source Location

Signature

def fetch_rcv1(
    *,
    data_home=None,
    subset="all",
    download_if_missing=True,
    random_state=None,
    shuffle=False,
    return_X_y=False,
    n_retries=3,
    delay=1.0,
)

Import

from sklearn.datasets import fetch_rcv1

I/O Contract

Inputs

Name Type Required Description
data_home str or PathLike or None No Custom directory for data storage
subset str No Subset to load: 'train', 'test', or 'all' (default: 'all')
download_if_missing bool No Whether to download if not cached (default: True)
random_state int or None No Random state for reproducible shuffling
shuffle bool No Whether to shuffle the dataset (default: False)
return_X_y bool No If True, return (data, target) tuple (default: False)

Outputs

Name Type Description
data Bunch Dictionary-like with data (sparse matrix), target (sparse indicator matrix), sample_id, target_names, DESCR
(X, y) tuple Returned when return_X_y=True; sparse feature matrix and sparse target matrix

Usage Examples

Basic Usage

from sklearn.datasets import fetch_rcv1

rcv1 = fetch_rcv1(subset='train')
print("Feature matrix shape:", rcv1.data.shape)
print("Target matrix shape:", rcv1.target.shape)
print("Number of categories:", len(rcv1.target_names))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment