Implementation:Scikit learn Scikit learn FetchRcv1
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Text Classification |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for fetching the RCV1 (Reuters Corpus Volume I) text classification dataset provided by scikit-learn.
Description
This module implements the fetch_rcv1 function that downloads and loads the RCV1 dataset, a benchmark corpus for text categorization research. The dataset contains over 800,000 newswire stories with topic labels from Reuters. The feature vectors are pre-computed TF-IDF representations stored in sparse format (loaded via svmlight format). The module supports loading train, test, or all subsets, with optional shuffling.
Usage
Use this function to load the RCV1 dataset for large-scale text classification experiments, evaluating multi-label classification algorithms, or benchmarking linear models on high-dimensional sparse data.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/datasets/_rcv1.py
Signature
def fetch_rcv1(
*,
data_home=None,
subset="all",
download_if_missing=True,
random_state=None,
shuffle=False,
return_X_y=False,
n_retries=3,
delay=1.0,
)
Import
from sklearn.datasets import fetch_rcv1
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_home | str or PathLike or None | No | Custom directory for data storage |
| subset | str | No | Subset to load: 'train', 'test', or 'all' (default: 'all') |
| download_if_missing | bool | No | Whether to download if not cached (default: True) |
| random_state | int or None | No | Random state for reproducible shuffling |
| shuffle | bool | No | Whether to shuffle the dataset (default: False) |
| return_X_y | bool | No | If True, return (data, target) tuple (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | Bunch | Dictionary-like with data (sparse matrix), target (sparse indicator matrix), sample_id, target_names, DESCR |
| (X, y) | tuple | Returned when return_X_y=True; sparse feature matrix and sparse target matrix |
Usage Examples
Basic Usage
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1(subset='train')
print("Feature matrix shape:", rcv1.data.shape)
print("Target matrix shape:", rcv1.target.shape)
print("Number of categories:", len(rcv1.target_names))