Implementation:Datajuicer Data juicer DiversityAnalysis
| Knowledge Sources | |
|---|---|
| Domains | Data_Analysis |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for analyzing lexical diversity of text data provided by Data-Juicer.
Description
DiversityAnalysis analyzes the lexical diversity of text data by extracting root verb and direct object noun pairs from sentences using spaCy dependency parsing, then summarizing the top verb-noun combinations. It uses a spaCy language model to parse each sample's text into dependency trees, extracts the root verb and its direct object via find_root_verb_and_its_dobj, then aggregates results with get_diversity which groups by verb, selects the top-k verbs, and for each verb keeps the top-k noun co-occurrences. Results are exported as CSV and Markdown files. Inspired by the Self-Instruct methodology.
Usage
Use when you need to measure and understand the variety of actions and topics in training data, particularly for instruction-following datasets where verb-noun diversity is an indicator of data quality.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/analysis/diversity_analysis.py
Signature
class DiversityAnalysis:
def __init__(self, dataset, output_path, lang_or_model="en"):
def compute(self, lang_or_model=None, column_name="text"):
def analyze(self, lang_or_model=None, column_name="text",
postproc_func=get_diversity, **postproc_kwarg):
Import
from data_juicer.analysis.diversity_analysis import DiversityAnalysis
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | The dataset to be analyzed |
| output_path | str | Yes | Path to store the analysis results (CSV and Markdown) |
| lang_or_model | str or spacy.Language | No | The diversity model or a language code (e.g. "en") used to load the spaCy model. Default: "en" |
| column_name | str | No | The name of the text column to analyze. Default: "text" |
| top_k_verbs | int | No | Number of top verb groups to keep. Default: 20 |
| top_k_nouns | int | No | Number of top noun groups per verb to keep. Default: 4 |
Outputs
| Name | Type | Description |
|---|---|---|
| df | pandas.DataFrame | DataFrame containing the diversity analysis results with verb, noun, and count columns |
Usage Examples
from data_juicer.analysis.diversity_analysis import DiversityAnalysis
# Analyze diversity of a dataset
analyzer = DiversityAnalysis(
dataset=my_dataset,
output_path="./analysis_output",
lang_or_model="en"
)
# Run full analysis pipeline
diversity_df = analyzer.analyze(
column_name="text",
top_k_verbs=20,
top_k_nouns=4
)
# Results saved to analysis_output/diversity.csv and diversity.md