Implementation:Ucbepic Docetl ExperimentStructuredOutputs

Knowledge Sources	Ucbepic_Docetl
Domains	Data_Processing, Experimentation, Benchmarking
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for benchmarking structured output methods (JSON schema vs. tool calling) across LLM models provided by DocETL.

Description

The structured_outputs experiment module compares two approaches for extracting structured data from LLMs: JSON structured output (using response_format) and tool calling (using function calling with tool_choice). It injects known fruits and vegetables into presidential debate transcripts, then measures precision, recall, F1 score, runtime, and cost for each extraction method across configurable models (e.g., "azure/gpt-4o-mini", "deepseek/deepseek-chat"). The experiment uses parallel processing with concurrent.futures.ThreadPoolExecutor and outputs detailed per-model, per-document results as JSON with Rich console table summaries.

Usage

Use this experiment to evaluate which structured output method (JSON schema or tool calling) performs better for a given LLM model in terms of extraction accuracy, latency, and cost. Results inform the choice of output strategy in DocETL pipeline operations.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: experiments/structured_outputs.py
Lines: 1-374

Signature

FRUITS_VEGETABLES: list[str]  # 40 items used for injection
MODELS: list[str]  # Models to benchmark
SYSTEM_PROMPT: str
STRUCTURED_SYSTEM_PROMPT: str
PROMPT_TEMPLATE: str

class FoundItems(BaseModel):
    fruits_and_vegetables: list[str]

def load_and_augment_debates(filepath: str, num_samples: int = 20, frac_doc_content: float = 0.5) -> list[dict[str, any]]: ...

def evaluate_structured_output(model: str, text: str) -> tuple[set[str], float, float]: ...

def evaluate_tool_calling(model: str, text: str) -> tuple[set[str], float, float]: ...

def calculate_metrics(extracted: set[str], ground_truth: set[str]) -> dict[str, float]: ...

def process_document(args) -> dict[str, any]: ...

def run_experiment(debates_file: str, num_samples: int = 20, max_workers: int = 64): ...

Import

from experiments.structured_outputs import (
    run_experiment,
    evaluate_structured_output,
    evaluate_tool_calling,
    calculate_metrics,
    load_and_augment_debates,
)

I/O Contract

Inputs

Name	Type	Required	Description
debates_file	str	Yes	Path to JSON file containing presidential debate transcripts
num_samples	int	No	Number of debate documents to sample (default: 20)
max_workers	int	No	Maximum parallel workers for concurrent processing (default: 64)
frac_doc_content	float	No	Fraction of document content to use (default: 0.5)
model	str	Yes	LLM model identifier (e.g., "azure/gpt-4o-mini")
text	str	Yes	Augmented debate text with injected items for extraction

Outputs

Name	Type	Description
extracted_items	set[str]	Set of extracted fruit/vegetable names
runtime	float	Wall-clock time for the LLM call in seconds
cost	float	LLM API cost for the call
metrics	dict[str, float]	Dictionary with precision, recall, and f1 scores
results	dict	Full experiment results per model, fraction, and method

Usage Examples

from experiments.structured_outputs import run_experiment, calculate_metrics

# Run the full benchmark experiment
run_experiment(
    debates_file="data/presidential_debates.json",
    num_samples=20,
    max_workers=32,
)

# Calculate metrics for a single extraction
extracted = {"apple", "banana", "carrot"}
ground_truth = {"apple", "banana", "fig"}
metrics = calculate_metrics(extracted, ground_truth)
print(f"Precision: {metrics['precision']:.2f}, Recall: {metrics['recall']:.2f}, F1: {metrics['f1']:.2f}")

Related Pages

Environment:Ucbepic_Docetl_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment