Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl Demo EpsteinEmailPipeline

From Leeroopedia


Knowledge Sources
Domains Pipeline_Configuration, Data_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

YAML pipeline configuration defining a multi-step DocETL pipeline for extracting structured metadata from Jeffrey Epstein email correspondence released by the House Oversight Committee.

Description

This file defines a comprehensive DocETL pipeline that processes raw email text documents through a series of map, filter, and code operations to extract structured metadata. The pipeline uses azure/gpt-4.1-mini as the default model and includes operations for document classification, email header extraction, participant normalization, date/time standardization, entity extraction, name deduplication, indirect person inference, and content/legality analysis. It demonstrates DocETL's ability to chain together LLM-powered map operations with Python code operations for data cleaning and standardization in an investigative journalism use case.

Usage

This pipeline configuration is stored in the website/public/demos directory and is served as a demo pipeline on the DocETL website. It showcases a real-world investigative analysis use case where unstructured email data is systematically processed into structured, searchable metadata.

Code Reference

Source Location

Pipeline Structure

default_model: azure/gpt-4.1-mini
bypass_cache: false

fallback_models:
  - groq/openai/gpt-oss-120b

system_prompt:
  dataset_description: Email correspondence from Jeffrey Epstein's estate...
  persona: You are a data analyst...

datasets:
  epstein_emails:
    type: file
    path: "epstein-files/emails_dataset.json"

operations:
  - name: sample_documents        # (0) Uniform sampling for testing
    type: sample

  - name: classify_document_type  # (1) Classify as email vs non-email
    type: map

  - name: filter_emails_only      # (1b) Filter to only actual emails
    type: code_filter

  - name: extract_email_headers   # (2) Extract headers and metadata
    type: map

  - name: clean_participants      # (3) Clean participant list
    type: code_map

  - name: standardize_datetime    # (4) Normalize date/time formats
    type: code_map

  - name: extract_entities_and_topics  # (5) NER and topic classification
    type: map

  - name: standardize_names       # (6) Normalize name variants
    type: code_map

  - name: deduplicate_entities    # (7) Deduplicate entity lists
    type: code_map

  - name: infer_people            # (8) Infer indirectly referenced people
    type: map

  - name: analyze_content_and_legality  # (9) Content and legality analysis
    type: map

pipeline:
  steps:
    - name: extract_metadata
      input: epstein_emails
      operations: [classify_document_type, filter_emails_only, ...]

  output:
    type: file
    path: "epstein-files/emails_with_metadata.json"

I/O Contract

Pipeline Operations

Step Operation Name Type Description
0 sample_documents sample Uniform sampling of 1000 documents for testing
1 classify_document_type map LLM classifies documents as email or non-email based on header presence
1b filter_emails_only code_filter Python filter retaining only documents classified as emails
2 extract_email_headers map LLM extracts participants, date, time, subject, and attachment info
3 clean_participants code_map Python deduplicates and cleans participant lists
4 standardize_datetime code_map Python normalizes date (YYYY-MM-DD) and time (HH:MM:SS) formats
5 extract_entities_and_topics map LLM extracts people, organizations, locations, URLs, and classifies topics
6 standardize_names code_map Python normalizes known name variants (e.g., "Jeff Epstein" to "Jeffrey Epstein")
7 deduplicate_entities code_map Python deduplicates all entity arrays case-insensitively
8 infer_people map LLM infers indirectly referenced people from nicknames and coded language
9 analyze_content_and_legality map LLM summarizes content, extracts key quotes, and assesses potential illegality

Input Schema

Field Type Description
email_text string Raw text of the email document

Output Schema

Field Type Description
participants list[object] Normalized participant list with name and email fields
date string Standardized date in YYYY-MM-DD format
time string Standardized time in HH:MM:SS format
subject string Email subject line
has_attachments boolean Whether attachments were present
attachment_names list[string] List of attachment filenames
people_mentioned list[string] Explicitly mentioned people
organizations list[string] Organizations mentioned
locations list[string] Locations referenced
notable_figures list[string] High-profile individuals mentioned
primary_topic string Primary topic classification
topics list[string] All applicable topic tags
inferred_people list[object] Indirectly referenced people with reasoning
summary string Content summary
key_quotes list[string] Significant direct quotes
tone string Conversation tone classification
evidence_strength string Illegality evidence strength rating
crime_types list[string] Potential crime type classifications

Usage Examples

# Run the pipeline using the DocETL CLI
docetl run website/public/demos/epstein_email_pipeline.yaml
import yaml

with open("website/public/demos/epstein_email_pipeline.yaml") as f:
    pipeline = yaml.safe_load(f)
print(f"Default model: {pipeline['default_model']}")
print(f"Number of operations: {len(pipeline['operations'])}")
for op in pipeline['operations']:
    print(f"  {op['name']} ({op['type']})")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment