Implementation:Ucbepic Docetl Dataset Debate Transcripts
| Knowledge Sources | |
|---|---|
| Domains | Sample_Data, Data_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
JSON dataset providing a curated collection of U.S. presidential and vice-presidential debate transcripts spanning from 1960 to 2024, used as input data for the DocETL website demo.
Description
This file contains full-text transcripts of notable U.S. presidential and vice-presidential debates organized chronologically. Each record includes the debate year, date, title, and complete transcript text. The collection spans over six decades of American political discourse, from the Kennedy-Nixon debates of 1960 through the Biden-Trump and Harris-Trump debates of 2024. This dataset serves as the primary input for the debate theme evolution analysis pipeline demonstrated on the DocETL website.
Usage
This dataset is stored in the website/public directory and is served as a static asset on the DocETL website. It is used as the input dataset for the presidential debate theme evolution analysis demo, which processes these transcripts through DocETL map and reduce operations to generate thematic analyses of how political viewpoints have evolved over time.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: website/public/debate_transcripts.json
- Lines: 394
Data Structure
[
{
"year": 2024,
"date": "June 27, 2024",
"title": "The Biden-Trump Presidential Debate"
},
{
"year": 2024,
"date": "September 10, 2024",
"title": "The Harris-Trump Presidential Debate"
},
{
"year": 2016,
"date": "October 9, 2016",
"title": "The Second Clinton-Trump Presidential Debate"
}
]
I/O Contract
Schema
| Field | Type | Description |
|---|---|---|
| year | integer | Year of the debate (ranges from 1960 to 2024) |
| date | string | Human-readable date of the debate (e.g., "June 27, 2024") |
| title | string | Full descriptive title of the debate (e.g., "The Biden-Trump Presidential Debate") |
Note: Each record also contains the full transcript text, which is omitted from the structure sample above due to length.
Debates Included
The dataset includes transcripts from the following debates (partial list):
- 2024: Biden-Trump, Harris-Trump
- 2016: First, Second, and Third Clinton-Trump
- 2008: Second and Third McCain-Obama
- 2004: Second Bush-Kerry, Cheney-Edwards VP
- 2000: First and Third Gore-Bush, Lieberman-Cheney VP
- 1996: First Clinton-Dole, Gore-Kemp VP, Second Presidential
- 1992: First and Second Clinton-Bush-Perot (halves)
- 1988: First and Second Bush-Dukakis, Bentsen-Quayle VP
- 1984: Second Reagan-Mondale
- 1980: Anderson-Reagan
- 1976: First, Second, and Third Carter-Ford
- 1960: Second, Third, and Fourth Kennedy-Nixon
Usage Examples
import json
with open("website/public/debate_transcripts.json") as f:
data = json.load(f)
# data is a list of debate transcript records with fields: year, date, title
print(f"Total debates: {len(data)}")
for debate in data:
print(f"{debate['year']}: {debate['title']} ({debate['date']})")