Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl ExperimentStructuredOutputs

From Leeroopedia
Revision as of 17:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ucbepic_Docetl_ExperimentStructuredOutputs.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Processing, Experimentation, Benchmarking
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for benchmarking structured output methods (JSON schema vs. tool calling) across LLM models provided by DocETL.

Description

The structured_outputs experiment module compares two approaches for extracting structured data from LLMs: JSON structured output (using response_format) and tool calling (using function calling with tool_choice). It injects known fruits and vegetables into presidential debate transcripts, then measures precision, recall, F1 score, runtime, and cost for each extraction method across configurable models (e.g., "azure/gpt-4o-mini", "deepseek/deepseek-chat"). The experiment uses parallel processing with concurrent.futures.ThreadPoolExecutor and outputs detailed per-model, per-document results as JSON with Rich console table summaries.

Usage

Use this experiment to evaluate which structured output method (JSON schema or tool calling) performs better for a given LLM model in terms of extraction accuracy, latency, and cost. Results inform the choice of output strategy in DocETL pipeline operations.

Code Reference

Source Location

Signature

FRUITS_VEGETABLES: list[str]  # 40 items used for injection
MODELS: list[str]  # Models to benchmark
SYSTEM_PROMPT: str
STRUCTURED_SYSTEM_PROMPT: str
PROMPT_TEMPLATE: str

class FoundItems(BaseModel):
    fruits_and_vegetables: list[str]

def load_and_augment_debates(filepath: str, num_samples: int = 20, frac_doc_content: float = 0.5) -> list[dict[str, any]]: ...

def evaluate_structured_output(model: str, text: str) -> tuple[set[str], float, float]: ...

def evaluate_tool_calling(model: str, text: str) -> tuple[set[str], float, float]: ...

def calculate_metrics(extracted: set[str], ground_truth: set[str]) -> dict[str, float]: ...

def process_document(args) -> dict[str, any]: ...

def run_experiment(debates_file: str, num_samples: int = 20, max_workers: int = 64): ...

Import

from experiments.structured_outputs import (
    run_experiment,
    evaluate_structured_output,
    evaluate_tool_calling,
    calculate_metrics,
    load_and_augment_debates,
)

I/O Contract

Inputs

Name Type Required Description
debates_file str Yes Path to JSON file containing presidential debate transcripts
num_samples int No Number of debate documents to sample (default: 20)
max_workers int No Maximum parallel workers for concurrent processing (default: 64)
frac_doc_content float No Fraction of document content to use (default: 0.5)
model str Yes LLM model identifier (e.g., "azure/gpt-4o-mini")
text str Yes Augmented debate text with injected items for extraction

Outputs

Name Type Description
extracted_items set[str] Set of extracted fruit/vegetable names
runtime float Wall-clock time for the LLM call in seconds
cost float LLM API cost for the call
metrics dict[str, float] Dictionary with precision, recall, and f1 scores
results dict Full experiment results per model, fraction, and method

Usage Examples

from experiments.structured_outputs import run_experiment, calculate_metrics

# Run the full benchmark experiment
run_experiment(
    debates_file="data/presidential_debates.json",
    num_samples=20,
    max_workers=32,
)

# Calculate metrics for a single extraction
extracted = {"apple", "banana", "carrot"}
ground_truth = {"apple", "banana", "fig"}
metrics = calculate_metrics(extracted, ground_truth)
print(f"Precision: {metrics['precision']:.2f}, Recall: {metrics['recall']:.2f}, F1: {metrics['f1']:.2f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment