Implementation:Datajuicer Data juicer DiversityAnalysis

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Analysis
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for analyzing lexical diversity of text data provided by Data-Juicer.

Description

DiversityAnalysis analyzes the lexical diversity of text data by extracting root verb and direct object noun pairs from sentences using spaCy dependency parsing, then summarizing the top verb-noun combinations. It uses a spaCy language model to parse each sample's text into dependency trees, extracts the root verb and its direct object via find_root_verb_and_its_dobj, then aggregates results with get_diversity which groups by verb, selects the top-k verbs, and for each verb keeps the top-k noun co-occurrences. Results are exported as CSV and Markdown files. Inspired by the Self-Instruct methodology.

Usage

Use when you need to measure and understand the variety of actions and topics in training data, particularly for instruction-following datasets where verb-noun diversity is an indicator of data quality.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/analysis/diversity_analysis.py

Signature

class DiversityAnalysis:
    def __init__(self, dataset, output_path, lang_or_model="en"):

    def compute(self, lang_or_model=None, column_name="text"):

    def analyze(self, lang_or_model=None, column_name="text",
                postproc_func=get_diversity, **postproc_kwarg):

Import

from data_juicer.analysis.diversity_analysis import DiversityAnalysis

I/O Contract

Inputs

Name	Type	Required	Description
dataset	Dataset	Yes	The dataset to be analyzed
output_path	str	Yes	Path to store the analysis results (CSV and Markdown)
lang_or_model	str or spacy.Language	No	The diversity model or a language code (e.g. "en") used to load the spaCy model. Default: "en"
column_name	str	No	The name of the text column to analyze. Default: "text"
top_k_verbs	int	No	Number of top verb groups to keep. Default: 20
top_k_nouns	int	No	Number of top noun groups per verb to keep. Default: 4

Outputs

Name	Type	Description
df	pandas.DataFrame	DataFrame containing the diversity analysis results with verb, noun, and count columns

Usage Examples

from data_juicer.analysis.diversity_analysis import DiversityAnalysis

# Analyze diversity of a dataset
analyzer = DiversityAnalysis(
    dataset=my_dataset,
    output_path="./analysis_output",
    lang_or_model="en"
)

# Run full analysis pipeline
diversity_df = analyzer.analyze(
    column_name="text",
    top_k_verbs=20,
    top_k_nouns=4
)
# Results saved to analysis_output/diversity.csv and diversity.md

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment