Implementation:Datajuicer Data juicer Visualize App
| Knowledge Sources | |
|---|---|
| Domains | Web Application, Data Visualization, Streamlit |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Streamlit web application providing an interactive UI for data analysis, processing, and visualization using Data-Juicer's Analyzer and Executor components.
Description
This module implements the main interactive web interface for Data-Juicer. It creates a Streamlit dashboard where users can:
- Upload or specify YAML configuration files to define processing pipelines
- Analyze original datasets with statistical summaries and distribution charts
- Process datasets using configured operator pipelines
- Interactively adjust filter thresholds using sidebar sliders
- Visualize retained vs. discarded sample ratios with stacked bar charts and histograms
- Perform diversity analysis on CFT datasets using spaCy models with sunburst chart visualization
- Download processed data as CSV or JSONL
The Visualize class is the central orchestrator containing static methods for each UI section: `setup()` configures page layout, `parser()` handles configuration input, `analyze_process()` runs analysis and processing, `filter()` provides interactive filter exploration, `diversity()` performs diversity analysis, and `auxiliary()` provides placeholder quality scoring tools.
Helper functions include cached data converters (`convert_to_csv`, `convert_to_jsonl`), configuration parsers (`parse_cfg`, `pretty_out`), and analysis orchestrators (`analyze_and_show_res`, `process_and_show_res`).
Usage
Run the application with Streamlit to launch the interactive data processing dashboard. Users configure processing pipelines via YAML files and explore their effects on datasets in real-time.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: app.py
- Lines: 1-698
Signature
class Visualize:
@staticmethod
def setup(): ...
@staticmethod
def parser(): ...
@staticmethod
def analyze_process(): ...
@staticmethod
def filter(): ...
@staticmethod
def diversity(): ...
@staticmethod
def auxiliary(): ...
@staticmethod
def visualize(): ...
@staticmethod
def filter_dataset(dataset): ...
@staticmethod
def draw_hist(data, cutoff=None): ...
@staticmethod
def draw_stack_bar(bar_sizes, bar_labels, total_num, title=""): ...
@staticmethod
def draw_sunburst(df, path, values): ...
def main():
Import
import streamlit as st
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
from data_juicer.config import init_configs
from data_juicer.core import Analyzer, DefaultExecutor
from data_juicer.analysis.diversity_analysis import DiversityAnalysis, get_diversity
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config file | YAML file | Yes | Data-Juicer configuration specifying dataset path, export path, and processing operators |
| config command | str | No | Command-line style config specification (e.g., "--config path/to/config.yaml") |
Outputs
| Name | Type | Description |
|---|---|---|
| Streamlit dashboard | Web UI | Interactive visualization of dataset statistics, filter effects, and diversity analysis |
| Retained/Discarded data | JSONL files | Downloadable filtered dataset results |
| Diversity CSV | CSV file | Downloadable diversity analysis results |
Usage Examples
# Launch the Streamlit app from the command line:
# streamlit run app.py
# The app provides interactive widgets for:
# 1. Uploading or specifying a YAML config file
# 2. Parsing and displaying the configuration
# 3. Analyzing original data with statistical charts
# 4. Processing data through configured operators
# 5. Exploring filter effects with adjustable sliders
# 6. Performing diversity analysis with sunburst charts