Implementation:Scikit learn Scikit learn Fetch20Newsgroups

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Data Loading, Text Classification
Last Updated	2026-02-08 15:00 GMT

Overview

Concrete tool for fetching the 20 Newsgroups text classification dataset provided by scikit-learn.

Description

This module provides two functions for loading the 20 Newsgroups dataset: fetch_20newsgroups for raw text documents and fetch_20newsgroups_vectorized for pre-vectorized TF-IDF features. The dataset contains approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. The module uses a "by date" split between train and test sets, supports category filtering, and can optionally strip headers, footers, and quotes from the documents.

Usage

Use fetch_20newsgroups for text classification, topic modeling, or NLP experiments. Use fetch_20newsgroups_vectorized when you need pre-computed TF-IDF features without running your own feature extraction pipeline.

Code Reference

Source Location

Repository: scikit-learn
File: sklearn/datasets/_twenty_newsgroups.py

Signature

def fetch_20newsgroups(
    *,
    data_home=None,
    subset="train",
    categories=None,
    shuffle=True,
    random_state=42,
    remove=(),
    download_if_missing=True,
    return_X_y=False,
    n_retries=3,
    delay=1.0,
)

def fetch_20newsgroups_vectorized(
    *,
    subset="train",
    remove=(),
    data_home=None,
    download_if_missing=True,
    return_X_y=False,
    normalize=True,
    as_frame=False,
    n_retries=3,
    delay=1.0,
)

Import

from sklearn.datasets import fetch_20newsgroups, fetch_20newsgroups_vectorized

I/O Contract

Inputs

Name	Type	Required	Description
data_home	str or None	No	Custom directory for data storage
subset	str	No	Subset to load: 'train', 'test', or 'all' (default: 'train')
categories	list or None	No	List of newsgroup names to load; None for all (default: None)
shuffle	bool	No	Whether to shuffle the data (default: True)
random_state	int	No	Random state for reproducible shuffling (default: 42)
remove	tuple	No	Parts to strip: 'headers', 'footers', 'quotes' (default: ())
return_X_y	bool	No	If True, return (data, target) tuple (default: False)
normalize	bool	No	For vectorized: normalize TF-IDF features (default: True)
as_frame	bool	No	For vectorized: return as DataFrame (default: False)

Outputs

Name	Type	Description
data	Bunch	Dictionary-like with data (text list or sparse matrix), target, target_names, DESCR, filenames
(X, y)	tuple	Returned when return_X_y=True; documents/features and target labels

Usage Examples

Basic Usage

from sklearn.datasets import fetch_20newsgroups

# Load only two categories
categories = ['alt.atheism', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers',))
print("Number of documents:", len(newsgroups.data))
print("Categories:", newsgroups.target_names)
print("First document:", newsgroups.data[0][:100])

Related Pages

Principle:Scikit_learn_Scikit_learn_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment