Implementation:Scikit learn Scikit learn Fetch20Newsgroups
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, Text Classification |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for fetching the 20 Newsgroups text classification dataset provided by scikit-learn.
Description
This module provides two functions for loading the 20 Newsgroups dataset: fetch_20newsgroups for raw text documents and fetch_20newsgroups_vectorized for pre-vectorized TF-IDF features. The dataset contains approximately 20,000 newsgroup documents partitioned across 20 different newsgroups. The module uses a "by date" split between train and test sets, supports category filtering, and can optionally strip headers, footers, and quotes from the documents.
Usage
Use fetch_20newsgroups for text classification, topic modeling, or NLP experiments. Use fetch_20newsgroups_vectorized when you need pre-computed TF-IDF features without running your own feature extraction pipeline.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/datasets/_twenty_newsgroups.py
Signature
def fetch_20newsgroups(
*,
data_home=None,
subset="train",
categories=None,
shuffle=True,
random_state=42,
remove=(),
download_if_missing=True,
return_X_y=False,
n_retries=3,
delay=1.0,
)
def fetch_20newsgroups_vectorized(
*,
subset="train",
remove=(),
data_home=None,
download_if_missing=True,
return_X_y=False,
normalize=True,
as_frame=False,
n_retries=3,
delay=1.0,
)
Import
from sklearn.datasets import fetch_20newsgroups, fetch_20newsgroups_vectorized
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_home | str or None | No | Custom directory for data storage |
| subset | str | No | Subset to load: 'train', 'test', or 'all' (default: 'train') |
| categories | list or None | No | List of newsgroup names to load; None for all (default: None) |
| shuffle | bool | No | Whether to shuffle the data (default: True) |
| random_state | int | No | Random state for reproducible shuffling (default: 42) |
| remove | tuple | No | Parts to strip: 'headers', 'footers', 'quotes' (default: ()) |
| return_X_y | bool | No | If True, return (data, target) tuple (default: False) |
| normalize | bool | No | For vectorized: normalize TF-IDF features (default: True) |
| as_frame | bool | No | For vectorized: return as DataFrame (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | Bunch | Dictionary-like with data (text list or sparse matrix), target, target_names, DESCR, filenames |
| (X, y) | tuple | Returned when return_X_y=True; documents/features and target labels |
Usage Examples
Basic Usage
from sklearn.datasets import fetch_20newsgroups
# Load only two categories
categories = ['alt.atheism', 'sci.space']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers',))
print("Number of documents:", len(newsgroups.data))
print("Categories:", newsgroups.target_names)
print("First document:", newsgroups.data[0][:100])