Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer DiversityAnalysis

From Leeroopedia
Knowledge Sources
Domains Data_Analysis
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for analyzing lexical diversity of text data provided by Data-Juicer.

Description

DiversityAnalysis analyzes the lexical diversity of text data by extracting root verb and direct object noun pairs from sentences using spaCy dependency parsing, then summarizing the top verb-noun combinations. It uses a spaCy language model to parse each sample's text into dependency trees, extracts the root verb and its direct object via find_root_verb_and_its_dobj, then aggregates results with get_diversity which groups by verb, selects the top-k verbs, and for each verb keeps the top-k noun co-occurrences. Results are exported as CSV and Markdown files. Inspired by the Self-Instruct methodology.

Usage

Use when you need to measure and understand the variety of actions and topics in training data, particularly for instruction-following datasets where verb-noun diversity is an indicator of data quality.

Code Reference

Source Location

Signature

class DiversityAnalysis:
    def __init__(self, dataset, output_path, lang_or_model="en"):

    def compute(self, lang_or_model=None, column_name="text"):

    def analyze(self, lang_or_model=None, column_name="text",
                postproc_func=get_diversity, **postproc_kwarg):

Import

from data_juicer.analysis.diversity_analysis import DiversityAnalysis

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes The dataset to be analyzed
output_path str Yes Path to store the analysis results (CSV and Markdown)
lang_or_model str or spacy.Language No The diversity model or a language code (e.g. "en") used to load the spaCy model. Default: "en"
column_name str No The name of the text column to analyze. Default: "text"
top_k_verbs int No Number of top verb groups to keep. Default: 20
top_k_nouns int No Number of top noun groups per verb to keep. Default: 4

Outputs

Name Type Description
df pandas.DataFrame DataFrame containing the diversity analysis results with verb, noun, and count columns

Usage Examples

from data_juicer.analysis.diversity_analysis import DiversityAnalysis

# Analyze diversity of a dataset
analyzer = DiversityAnalysis(
    dataset=my_dataset,
    output_path="./analysis_output",
    lang_or_model="en"
)

# Run full analysis pipeline
diversity_df = analyzer.analyze(
    column_name="text",
    top_k_verbs=20,
    top_k_nouns=4
)
# Results saved to analysis_output/diversity.csv and diversity.md

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment