Implementation:Scikit learn Scikit learn FetchRcv1

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Data Loading, Text Classification
Last Updated	2026-02-08 15:00 GMT

Overview

Concrete tool for fetching the RCV1 (Reuters Corpus Volume I) text classification dataset provided by scikit-learn.

Description

This module implements the fetch_rcv1 function that downloads and loads the RCV1 dataset, a benchmark corpus for text categorization research. The dataset contains over 800,000 newswire stories with topic labels from Reuters. The feature vectors are pre-computed TF-IDF representations stored in sparse format (loaded via svmlight format). The module supports loading train, test, or all subsets, with optional shuffling.

Usage

Use this function to load the RCV1 dataset for large-scale text classification experiments, evaluating multi-label classification algorithms, or benchmarking linear models on high-dimensional sparse data.

Code Reference

Source Location

Repository: scikit-learn
File: sklearn/datasets/_rcv1.py

Signature

def fetch_rcv1(
    *,
    data_home=None,
    subset="all",
    download_if_missing=True,
    random_state=None,
    shuffle=False,
    return_X_y=False,
    n_retries=3,
    delay=1.0,
)

Import

from sklearn.datasets import fetch_rcv1

I/O Contract

Inputs

Name	Type	Required	Description
data_home	str or PathLike or None	No	Custom directory for data storage
subset	str	No	Subset to load: 'train', 'test', or 'all' (default: 'all')
download_if_missing	bool	No	Whether to download if not cached (default: True)
random_state	int or None	No	Random state for reproducible shuffling
shuffle	bool	No	Whether to shuffle the dataset (default: False)
return_X_y	bool	No	If True, return (data, target) tuple (default: False)

Outputs

Name	Type	Description
data	Bunch	Dictionary-like with data (sparse matrix), target (sparse indicator matrix), sample_id, target_names, DESCR
(X, y)	tuple	Returned when return_X_y=True; sparse feature matrix and sparse target matrix

Usage Examples

Basic Usage

from sklearn.datasets import fetch_rcv1

rcv1 = fetch_rcv1(subset='train')
print("Feature matrix shape:", rcv1.data.shape)
print("Target matrix shape:", rcv1.target.shape)
print("Number of categories:", len(rcv1.target_names))

Related Pages

Principle:Scikit_learn_Scikit_learn_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment