Implementation:Ucbepic Docetl Demo EpsteinEmailPipeline

Knowledge Sources	Ucbepic_Docetl
Domains	Pipeline_Configuration, Data_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

YAML pipeline configuration defining a multi-step DocETL pipeline for extracting structured metadata from Jeffrey Epstein email correspondence released by the House Oversight Committee.

Description

This file defines a comprehensive DocETL pipeline that processes raw email text documents through a series of map, filter, and code operations to extract structured metadata. The pipeline uses azure/gpt-4.1-mini as the default model and includes operations for document classification, email header extraction, participant normalization, date/time standardization, entity extraction, name deduplication, indirect person inference, and content/legality analysis. It demonstrates DocETL's ability to chain together LLM-powered map operations with Python code operations for data cleaning and standardization in an investigative journalism use case.

Usage

This pipeline configuration is stored in the website/public/demos directory and is served as a demo pipeline on the DocETL website. It showcases a real-world investigative analysis use case where unstructured email data is systematically processed into structured, searchable metadata.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: website/public/demos/epstein_email_pipeline.yaml
Lines: 557

Pipeline Structure

default_model: azure/gpt-4.1-mini
bypass_cache: false

fallback_models:
  - groq/openai/gpt-oss-120b

system_prompt:
  dataset_description: Email correspondence from Jeffrey Epstein's estate...
  persona: You are a data analyst...

datasets:
  epstein_emails:
    type: file
    path: "epstein-files/emails_dataset.json"

operations:
  - name: sample_documents        # (0) Uniform sampling for testing
    type: sample

  - name: classify_document_type  # (1) Classify as email vs non-email
    type: map

  - name: filter_emails_only      # (1b) Filter to only actual emails
    type: code_filter

  - name: extract_email_headers   # (2) Extract headers and metadata
    type: map

  - name: clean_participants      # (3) Clean participant list
    type: code_map

  - name: standardize_datetime    # (4) Normalize date/time formats
    type: code_map

  - name: extract_entities_and_topics  # (5) NER and topic classification
    type: map

  - name: standardize_names       # (6) Normalize name variants
    type: code_map

  - name: deduplicate_entities    # (7) Deduplicate entity lists
    type: code_map

  - name: infer_people            # (8) Infer indirectly referenced people
    type: map

  - name: analyze_content_and_legality  # (9) Content and legality analysis
    type: map

pipeline:
  steps:
    - name: extract_metadata
      input: epstein_emails
      operations: [classify_document_type, filter_emails_only, ...]

  output:
    type: file
    path: "epstein-files/emails_with_metadata.json"

I/O Contract

Pipeline Operations

Step	Operation Name	Type	Description
0	sample_documents	sample	Uniform sampling of 1000 documents for testing
1	classify_document_type	map	LLM classifies documents as email or non-email based on header presence
1b	filter_emails_only	code_filter	Python filter retaining only documents classified as emails
2	extract_email_headers	map	LLM extracts participants, date, time, subject, and attachment info
3	clean_participants	code_map	Python deduplicates and cleans participant lists
4	standardize_datetime	code_map	Python normalizes date (YYYY-MM-DD) and time (HH:MM:SS) formats
5	extract_entities_and_topics	map	LLM extracts people, organizations, locations, URLs, and classifies topics
6	standardize_names	code_map	Python normalizes known name variants (e.g., "Jeff Epstein" to "Jeffrey Epstein")
7	deduplicate_entities	code_map	Python deduplicates all entity arrays case-insensitively
8	infer_people	map	LLM infers indirectly referenced people from nicknames and coded language
9	analyze_content_and_legality	map	LLM summarizes content, extracts key quotes, and assesses potential illegality

Input Schema

Field	Type	Description
email_text	string	Raw text of the email document

Output Schema

Field	Type	Description
participants	list[object]	Normalized participant list with name and email fields
date	string	Standardized date in YYYY-MM-DD format
time	string	Standardized time in HH:MM:SS format
subject	string	Email subject line
has_attachments	boolean	Whether attachments were present
attachment_names	list[string]	List of attachment filenames
people_mentioned	list[string]	Explicitly mentioned people
organizations	list[string]	Organizations mentioned
locations	list[string]	Locations referenced
notable_figures	list[string]	High-profile individuals mentioned
primary_topic	string	Primary topic classification
topics	list[string]	All applicable topic tags
inferred_people	list[object]	Indirectly referenced people with reasoning
summary	string	Content summary
key_quotes	list[string]	Significant direct quotes
tone	string	Conversation tone classification
evidence_strength	string	Illegality evidence strength rating
crime_types	list[string]	Potential crime type classifications

Usage Examples

# Run the pipeline using the DocETL CLI
docetl run website/public/demos/epstein_email_pipeline.yaml

import yaml

with open("website/public/demos/epstein_email_pipeline.yaml") as f:
    pipeline = yaml.safe_load(f)
print(f"Default model: {pipeline['default_model']}")
print(f"Number of operations: {len(pipeline['operations'])}")
for op in pipeline['operations']:
    print(f"  {op['name']} ({op['type']})")

Related Pages

Environment:Ucbepic_Docetl_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment