Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn Scikit learn Fetch20Newsgroups

From Leeroopedia


Knowledge Sources
Domains Data Loading, Text Classification
Last Updated 2026-02-08 15:00 GMT

Overview

Concrete tool for fetching the 20 Newsgroups text classification dataset provided by scikit-learn.

Description

This module provides two functions for loading the 20 Newsgroups dataset: fetch_20newsgroups for raw text documents and fetch_20newsgroups_vectorized for pre-vectorized TF-IDF features. The dataset contains approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. The module uses a "by date" split between train and test sets, supports category filtering, and can optionally strip headers, footers, and quotes from the documents.

Usage

Use fetch_20newsgroups for text classification, topic modeling, or NLP experiments. Use fetch_20newsgroups_vectorized when you need pre-computed TF-IDF features without running your own feature extraction pipeline.

Code Reference

Source Location

Signature

def fetch_20newsgroups(
    *,
    data_home=None,
    subset="train",
    categories=None,
    shuffle=True,
    random_state=42,
    remove=(),
    download_if_missing=True,
    return_X_y=False,
    n_retries=3,
    delay=1.0,
)

def fetch_20newsgroups_vectorized(
    *,
    subset="train",
    remove=(),
    data_home=None,
    download_if_missing=True,
    return_X_y=False,
    normalize=True,
    as_frame=False,
    n_retries=3,
    delay=1.0,
)

Import

from sklearn.datasets import fetch_20newsgroups, fetch_20newsgroups_vectorized

I/O Contract

Inputs

Name Type Required Description
data_home str or None No Custom directory for data storage
subset str No Subset to load: 'train', 'test', or 'all' (default: 'train')
categories list or None No List of newsgroup names to load; None for all (default: None)
shuffle bool No Whether to shuffle the data (default: True)
random_state int No Random state for reproducible shuffling (default: 42)
remove tuple No Parts to strip: 'headers', 'footers', 'quotes' (default: ())
return_X_y bool No If True, return (data, target) tuple (default: False)
normalize bool No For vectorized: normalize TF-IDF features (default: True)
as_frame bool No For vectorized: return as DataFrame (default: False)

Outputs

Name Type Description
data Bunch Dictionary-like with data (text list or sparse matrix), target, target_names, DESCR, filenames
(X, y) tuple Returned when return_X_y=True; documents/features and target labels

Usage Examples

Basic Usage

from sklearn.datasets import fetch_20newsgroups

# Load only two categories
categories = ['alt.atheism', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers',))
print("Number of documents:", len(newsgroups.data))
print("Categories:", newsgroups.target_names)
print("First document:", newsgroups.data[0][:100])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment