Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Visualize App

From Leeroopedia
Knowledge Sources
Domains Web Application, Data Visualization, Streamlit
Last Updated 2026-02-14 16:00 GMT

Overview

Streamlit web application providing an interactive UI for data analysis, processing, and visualization using Data-Juicer's Analyzer and Executor components.

Description

This module implements the main interactive web interface for Data-Juicer. It creates a Streamlit dashboard where users can:

  • Upload or specify YAML configuration files to define processing pipelines
  • Analyze original datasets with statistical summaries and distribution charts
  • Process datasets using configured operator pipelines
  • Interactively adjust filter thresholds using sidebar sliders
  • Visualize retained vs. discarded sample ratios with stacked bar charts and histograms
  • Perform diversity analysis on CFT datasets using spaCy models with sunburst chart visualization
  • Download processed data as CSV or JSONL

The Visualize class is the central orchestrator containing static methods for each UI section: `setup()` configures page layout, `parser()` handles configuration input, `analyze_process()` runs analysis and processing, `filter()` provides interactive filter exploration, `diversity()` performs diversity analysis, and `auxiliary()` provides placeholder quality scoring tools.

Helper functions include cached data converters (`convert_to_csv`, `convert_to_jsonl`), configuration parsers (`parse_cfg`, `pretty_out`), and analysis orchestrators (`analyze_and_show_res`, `process_and_show_res`).

Usage

Run the application with Streamlit to launch the interactive data processing dashboard. Users configure processing pipelines via YAML files and explore their effects on datasets in real-time.

Code Reference

Source Location

Signature

class Visualize:
    @staticmethod
    def setup(): ...
    @staticmethod
    def parser(): ...
    @staticmethod
    def analyze_process(): ...
    @staticmethod
    def filter(): ...
    @staticmethod
    def diversity(): ...
    @staticmethod
    def auxiliary(): ...
    @staticmethod
    def visualize(): ...
    @staticmethod
    def filter_dataset(dataset): ...
    @staticmethod
    def draw_hist(data, cutoff=None): ...
    @staticmethod
    def draw_stack_bar(bar_sizes, bar_labels, total_num, title=""): ...
    @staticmethod
    def draw_sunburst(df, path, values): ...

def main():

Import

import streamlit as st
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from data_juicer.config import init_configs
from data_juicer.core import Analyzer, DefaultExecutor
from data_juicer.analysis.diversity_analysis import DiversityAnalysis, get_diversity

I/O Contract

Inputs

Name Type Required Description
config file YAML file Yes Data-Juicer configuration specifying dataset path, export path, and processing operators
config command str No Command-line style config specification (e.g., "--config path/to/config.yaml")

Outputs

Name Type Description
Streamlit dashboard Web UI Interactive visualization of dataset statistics, filter effects, and diversity analysis
Retained/Discarded data JSONL files Downloadable filtered dataset results
Diversity CSV CSV file Downloadable diversity analysis results

Usage Examples

# Launch the Streamlit app from the command line:
# streamlit run app.py

# The app provides interactive widgets for:
# 1. Uploading or specifying a YAML config file
# 2. Parsing and displaying the configuration
# 3. Analyzing original data with statistical charts
# 4. Processing data through configured operators
# 5. Exploring filter effects with adjustable sliders
# 6. Performing diversity analysis with sunburst charts

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment