Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl Dataset Medical Transcripts

From Leeroopedia


Knowledge Sources
Domains Sample_Data, Data_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

JSON dataset providing medical consultation transcripts paired with structured clinical notes for use as sample data in DocETL documentation and tutorials.

Description

This file contains sample medical transcription data where each record pairs a doctor-patient conversation transcript (src) with its corresponding structured clinical note (tgt). The transcripts simulate real clinical encounters covering conditions such as congestive heart failure, diabetes, kidney transplant management, arthritis, and more. Each structured note follows a standard medical documentation format with sections for Chief Complaint, History of Present Illness, Review of Systems, Physical Examination, Results, and Assessment and Plan. The dataset demonstrates DocETL's ability to process unstructured conversational text into structured outputs.

Usage

This dataset is stored in the docs/assets directory and is used in DocETL documentation tutorials to demonstrate transcript-to-note transformation pipelines. It showcases how LLM-powered document processing can convert unstructured medical conversations into standardized clinical documentation.

Code Reference

Source Location

Data Structure

[
  {
    "src": "[doctor] hi , martha . how are you ?\n[patient] i'm doing okay ...",
    "tgt": "CHIEF COMPLAINT\n\nAnnual exam.\n\nHISTORY OF PRESENT ILLNESS\n\nMartha Collins is a 50-year-old female ...",
    "file": "D2N001-virtassist"
  },
  {
    "src": "[doctor] hi , andrew , how are you ?\n[patient] hi . good to see you ...",
    "tgt": "CHIEF COMPLAINT\n\nJoint pain.\n\nHISTORY OF PRESENT ILLNESS\n\nAndrew Perez is a 62-year-old male ...",
    "file": "D2N002-virtassist"
  }
]

I/O Contract

Schema

Field Type Description
src string Raw doctor-patient conversation transcript with speaker labels ([doctor], [patient])
tgt string Structured clinical note following standard medical documentation format (Chief Complaint, HPI, ROS, Physical Exam, Results, Assessment and Plan)
file string Source file identifier (e.g., "D2N001-virtassist")

Usage Examples

import json

with open("docs/assets/medical_transcripts.json") as f:
    data = json.load(f)
# data is a list of medical transcript records with fields: src, tgt, file
print(f"Total transcripts: {len(data)}")
print(f"First transcript source file: {data[0]['file']}")
# Access the raw conversation
conversation = data[0]["src"]
# Access the structured clinical note
clinical_note = data[0]["tgt"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment