Implementation:Ucbepic Docetl SemanticAccessor Usage
| Knowledge Sources | |
|---|---|
| Domains | Data_Science, LLM_Operations |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete Pandas DataFrame accessor for LLM-powered semantic operations provided by DocETL.
Description
The SemanticAccessor class is registered as a Pandas DataFrame accessor at df.semantic. It provides methods for map, filter, reduce, agg, merge, split, gather, and unnest operations. Each method constructs a DocETL operation config, runs it through DSLRunner, records the operation in history, and returns a new DataFrame with results.
Usage
Import docetl.apis.pd_accessors to register the accessor. Then use df.semantic.map(), df.semantic.filter(), etc. on any Pandas DataFrame. Set the model with df.semantic.set_config(default_model="...").
Code Reference
Source Location
- Repository: docetl
- File: docetl/apis/pd_accessors.py
- Lines: L61-1069
Signature
@pd.api.extensions.register_dataframe_accessor("semantic")
class SemanticAccessor:
def __init__(self, df: pd.DataFrame): ...
def set_config(self, **config): ...
def map(self, prompt: str, output: dict | None = None, **kwargs) -> pd.DataFrame: ...
def filter(self, prompt: str, **kwargs) -> pd.DataFrame: ...
def reduce(self, prompt: str, output: dict | None = None,
reduce_keys: str | list[str] = ["_all"], **kwargs) -> pd.DataFrame: ...
def agg(self, reduce_prompt: str, ...) -> pd.DataFrame: ...
def merge(self, right: pd.DataFrame, comparison_prompt: str, **kwargs) -> pd.DataFrame: ...
def split(self, split_key: str, method: str, method_kwargs: dict, **kwargs) -> pd.DataFrame: ...
def gather(self, ...) -> pd.DataFrame: ...
def unnest(self, ...) -> pd.DataFrame: ...
@property
def total_cost(self) -> float: ...
@property
def history(self) -> list[OpHistory]: ...
Import
import docetl.apis.pd_accessors # Registers the .semantic accessor
import pandas as pd
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str | Yes | Jinja2 template for LLM operations |
| output | dict | No | Output schema definition |
| model | str | No | Set via set_config(default_model=...) |
Outputs
| Name | Type | Description |
|---|---|---|
| returns | pd.DataFrame | DataFrame with LLM-derived columns added |
| total_cost | float | Cumulative LLM cost across all operations |
| history | list[OpHistory] | Record of all applied operations |
Usage Examples
import pandas as pd
import docetl.apis.pd_accessors
df = pd.DataFrame({"text": ["Document 1 content", "Document 2 content"]})
df.semantic.set_config(default_model="gpt-4o-mini")
# Map: extract entities
result = df.semantic.map(
prompt="Extract entities from: {{ input.text }}",
output={"schema": {"entities": "list[str]"}}
)
# Filter: keep relevant documents
filtered = result.semantic.filter(
prompt="Is this document about technology? {{ input.text }}"
)
# Check costs
print(f"Total cost: ${filtered.semantic.total_cost:.2f}")