Implementation:Ucbepic Docetl ExperimentStructuredOutputs
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Experimentation, Benchmarking |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for benchmarking structured output methods (JSON schema vs. tool calling) across LLM models provided by DocETL.
Description
The structured_outputs experiment module compares two approaches for extracting structured data from LLMs: JSON structured output (using response_format) and tool calling (using function calling with tool_choice). It injects known fruits and vegetables into presidential debate transcripts, then measures precision, recall, F1 score, runtime, and cost for each extraction method across configurable models (e.g., "azure/gpt-4o-mini", "deepseek/deepseek-chat"). The experiment uses parallel processing with concurrent.futures.ThreadPoolExecutor and outputs detailed per-model, per-document results as JSON with Rich console table summaries.
Usage
Use this experiment to evaluate which structured output method (JSON schema or tool calling) performs better for a given LLM model in terms of extraction accuracy, latency, and cost. Results inform the choice of output strategy in DocETL pipeline operations.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: experiments/structured_outputs.py
- Lines: 1-374
Signature
FRUITS_VEGETABLES: list[str] # 40 items used for injection
MODELS: list[str] # Models to benchmark
SYSTEM_PROMPT: str
STRUCTURED_SYSTEM_PROMPT: str
PROMPT_TEMPLATE: str
class FoundItems(BaseModel):
fruits_and_vegetables: list[str]
def load_and_augment_debates(filepath: str, num_samples: int = 20, frac_doc_content: float = 0.5) -> list[dict[str, any]]: ...
def evaluate_structured_output(model: str, text: str) -> tuple[set[str], float, float]: ...
def evaluate_tool_calling(model: str, text: str) -> tuple[set[str], float, float]: ...
def calculate_metrics(extracted: set[str], ground_truth: set[str]) -> dict[str, float]: ...
def process_document(args) -> dict[str, any]: ...
def run_experiment(debates_file: str, num_samples: int = 20, max_workers: int = 64): ...
Import
from experiments.structured_outputs import (
run_experiment,
evaluate_structured_output,
evaluate_tool_calling,
calculate_metrics,
load_and_augment_debates,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| debates_file | str | Yes | Path to JSON file containing presidential debate transcripts |
| num_samples | int | No | Number of debate documents to sample (default: 20) |
| max_workers | int | No | Maximum parallel workers for concurrent processing (default: 64) |
| frac_doc_content | float | No | Fraction of document content to use (default: 0.5) |
| model | str | Yes | LLM model identifier (e.g., "azure/gpt-4o-mini") |
| text | str | Yes | Augmented debate text with injected items for extraction |
Outputs
| Name | Type | Description |
|---|---|---|
| extracted_items | set[str] | Set of extracted fruit/vegetable names |
| runtime | float | Wall-clock time for the LLM call in seconds |
| cost | float | LLM API cost for the call |
| metrics | dict[str, float] | Dictionary with precision, recall, and f1 scores |
| results | dict | Full experiment results per model, fraction, and method |
Usage Examples
from experiments.structured_outputs import run_experiment, calculate_metrics
# Run the full benchmark experiment
run_experiment(
debates_file="data/presidential_debates.json",
num_samples=20,
max_workers=32,
)
# Calculate metrics for a single extraction
extracted = {"apple", "banana", "carrot"}
ground_truth = {"apple", "banana", "fig"}
metrics = calculate_metrics(extracted, ground_truth)
print(f"Precision: {metrics['precision']:.2f}, Recall: {metrics['recall']:.2f}, F1: {metrics['f1']:.2f}")