Implementation:Ucbepic Docetl Dataset Medical Transcripts
| Knowledge Sources | |
|---|---|
| Domains | Sample_Data, Data_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
JSON dataset providing medical consultation transcripts paired with structured clinical notes for use as sample data in DocETL documentation and tutorials.
Description
This file contains sample medical transcription data where each record pairs a doctor-patient conversation transcript (src) with its corresponding structured clinical note (tgt). The transcripts simulate real clinical encounters covering conditions such as congestive heart failure, diabetes, kidney transplant management, arthritis, and more. Each structured note follows a standard medical documentation format with sections for Chief Complaint, History of Present Illness, Review of Systems, Physical Examination, Results, and Assessment and Plan. The dataset demonstrates DocETL's ability to process unstructured conversational text into structured outputs.
Usage
This dataset is stored in the docs/assets directory and is used in DocETL documentation tutorials to demonstrate transcript-to-note transformation pipelines. It showcases how LLM-powered document processing can convert unstructured medical conversations into standardized clinical documentation.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docs/assets/medical_transcripts.json
- Lines: 437
Data Structure
[
{
"src": "[doctor] hi , martha . how are you ?\n[patient] i'm doing okay ...",
"tgt": "CHIEF COMPLAINT\n\nAnnual exam.\n\nHISTORY OF PRESENT ILLNESS\n\nMartha Collins is a 50-year-old female ...",
"file": "D2N001-virtassist"
},
{
"src": "[doctor] hi , andrew , how are you ?\n[patient] hi . good to see you ...",
"tgt": "CHIEF COMPLAINT\n\nJoint pain.\n\nHISTORY OF PRESENT ILLNESS\n\nAndrew Perez is a 62-year-old male ...",
"file": "D2N002-virtassist"
}
]
I/O Contract
Schema
| Field | Type | Description |
|---|---|---|
| src | string | Raw doctor-patient conversation transcript with speaker labels ([doctor], [patient]) |
| tgt | string | Structured clinical note following standard medical documentation format (Chief Complaint, HPI, ROS, Physical Exam, Results, Assessment and Plan) |
| file | string | Source file identifier (e.g., "D2N001-virtassist") |
Usage Examples
import json
with open("docs/assets/medical_transcripts.json") as f:
data = json.load(f)
# data is a list of medical transcript records with fields: src, tgt, file
print(f"Total transcripts: {len(data)}")
print(f"First transcript source file: {data[0]['file']}")
# Access the raw conversation
conversation = data[0]["src"]
# Access the structured clinical note
clinical_note = data[0]["tgt"]