Implementation:Ucbepic Docetl Demo EpsteinEmailPipeline
| Knowledge Sources | |
|---|---|
| Domains | Pipeline_Configuration, Data_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
YAML pipeline configuration defining a multi-step DocETL pipeline for extracting structured metadata from Jeffrey Epstein email correspondence released by the House Oversight Committee.
Description
This file defines a comprehensive DocETL pipeline that processes raw email text documents through a series of map, filter, and code operations to extract structured metadata. The pipeline uses azure/gpt-4.1-mini as the default model and includes operations for document classification, email header extraction, participant normalization, date/time standardization, entity extraction, name deduplication, indirect person inference, and content/legality analysis. It demonstrates DocETL's ability to chain together LLM-powered map operations with Python code operations for data cleaning and standardization in an investigative journalism use case.
Usage
This pipeline configuration is stored in the website/public/demos directory and is served as a demo pipeline on the DocETL website. It showcases a real-world investigative analysis use case where unstructured email data is systematically processed into structured, searchable metadata.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: website/public/demos/epstein_email_pipeline.yaml
- Lines: 557
Pipeline Structure
default_model: azure/gpt-4.1-mini
bypass_cache: false
fallback_models:
- groq/openai/gpt-oss-120b
system_prompt:
dataset_description: Email correspondence from Jeffrey Epstein's estate...
persona: You are a data analyst...
datasets:
epstein_emails:
type: file
path: "epstein-files/emails_dataset.json"
operations:
- name: sample_documents # (0) Uniform sampling for testing
type: sample
- name: classify_document_type # (1) Classify as email vs non-email
type: map
- name: filter_emails_only # (1b) Filter to only actual emails
type: code_filter
- name: extract_email_headers # (2) Extract headers and metadata
type: map
- name: clean_participants # (3) Clean participant list
type: code_map
- name: standardize_datetime # (4) Normalize date/time formats
type: code_map
- name: extract_entities_and_topics # (5) NER and topic classification
type: map
- name: standardize_names # (6) Normalize name variants
type: code_map
- name: deduplicate_entities # (7) Deduplicate entity lists
type: code_map
- name: infer_people # (8) Infer indirectly referenced people
type: map
- name: analyze_content_and_legality # (9) Content and legality analysis
type: map
pipeline:
steps:
- name: extract_metadata
input: epstein_emails
operations: [classify_document_type, filter_emails_only, ...]
output:
type: file
path: "epstein-files/emails_with_metadata.json"
I/O Contract
Pipeline Operations
| Step | Operation Name | Type | Description |
|---|---|---|---|
| 0 | sample_documents | sample | Uniform sampling of 1000 documents for testing |
| 1 | classify_document_type | map | LLM classifies documents as email or non-email based on header presence |
| 1b | filter_emails_only | code_filter | Python filter retaining only documents classified as emails |
| 2 | extract_email_headers | map | LLM extracts participants, date, time, subject, and attachment info |
| 3 | clean_participants | code_map | Python deduplicates and cleans participant lists |
| 4 | standardize_datetime | code_map | Python normalizes date (YYYY-MM-DD) and time (HH:MM:SS) formats |
| 5 | extract_entities_and_topics | map | LLM extracts people, organizations, locations, URLs, and classifies topics |
| 6 | standardize_names | code_map | Python normalizes known name variants (e.g., "Jeff Epstein" to "Jeffrey Epstein") |
| 7 | deduplicate_entities | code_map | Python deduplicates all entity arrays case-insensitively |
| 8 | infer_people | map | LLM infers indirectly referenced people from nicknames and coded language |
| 9 | analyze_content_and_legality | map | LLM summarizes content, extracts key quotes, and assesses potential illegality |
Input Schema
| Field | Type | Description |
|---|---|---|
| email_text | string | Raw text of the email document |
Output Schema
| Field | Type | Description |
|---|---|---|
| participants | list[object] | Normalized participant list with name and email fields |
| date | string | Standardized date in YYYY-MM-DD format |
| time | string | Standardized time in HH:MM:SS format |
| subject | string | Email subject line |
| has_attachments | boolean | Whether attachments were present |
| attachment_names | list[string] | List of attachment filenames |
| people_mentioned | list[string] | Explicitly mentioned people |
| organizations | list[string] | Organizations mentioned |
| locations | list[string] | Locations referenced |
| notable_figures | list[string] | High-profile individuals mentioned |
| primary_topic | string | Primary topic classification |
| topics | list[string] | All applicable topic tags |
| inferred_people | list[object] | Indirectly referenced people with reasoning |
| summary | string | Content summary |
| key_quotes | list[string] | Significant direct quotes |
| tone | string | Conversation tone classification |
| evidence_strength | string | Illegality evidence strength rating |
| crime_types | list[string] | Potential crime type classifications |
Usage Examples
# Run the pipeline using the DocETL CLI
docetl run website/public/demos/epstein_email_pipeline.yaml
import yaml
with open("website/public/demos/epstein_email_pipeline.yaml") as f:
pipeline = yaml.safe_load(f)
print(f"Default model: {pipeline['default_model']}")
print(f"Number of operations: {len(pipeline['operations'])}")
for op in pipeline['operations']:
print(f" {op['name']} ({op['type']})")