Implementation:Ucbepic Docetl SampleOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Data_Sampling |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for selecting document subsets using various sampling strategies with optional stratification, provided by DocETL.
Description
The SampleOperation class extends BaseOperation to select subsets of documents from the input using six distinct sampling methods: "uniform" (random sampling), "first" (take first N), "outliers" (embedding-based outlier detection via distance from centroid), "custom" (user-provided key-based selection), "top_embedding" (most similar to a query by embedding cosine similarity), and "top_fts" (BM25-based full-text search scoring). All methods except "custom" support stratified sampling by grouping documents on one or more keys and allocating sample budgets proportionally or per group.
Usage
Use this operation when you need to reduce dataset size for cost or performance reasons, select representative subsets for analysis, identify outlier documents, or retrieve the most relevant documents by similarity to a query. Typical scenarios include downsampling large datasets before expensive LLM operations, selecting diverse representatives via stratified sampling, or finding documents most related to a search query.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/sample.py
- Lines: 1-682
Signature
class SampleOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "sample"
method: Literal["uniform", "outliers", "custom", "first", "top_embedding", "top_fts"]
samples: Union[int, float, list] | None = None
stratify_key: Union[str, list[str]] | None = None
samples_per_group: bool = False
method_kwargs: dict[str, Any] | None = Field(default_factory=dict)
random_state: int | None = Field(None, ge=0)
def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...
def _sample_first(self, input_data) -> tuple[list[dict], float]: ...
def _sample_uniform(self, input_data) -> tuple[list[dict], float]: ...
def _sample_outliers(self, input_data) -> tuple[list[dict], float]: ...
def _sample_custom(self, input_data) -> tuple[list[dict], float]: ...
def _sample_top_embedding(self, input_data) -> tuple[list[dict], float]: ...
def _sample_top_fts(self, input_data) -> tuple[list[dict], float]: ...
def _sample_with_stratification(self, input_data, method) -> tuple[list[dict], float]: ...
Import
from docetl.operations.sample import SampleOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Documents to sample from |
| method | str | Yes | Sampling method: "uniform", "first", "outliers", "custom", "top_embedding", or "top_fts" |
| samples | int, float, or list | Conditional | Number of samples (int), fraction (float), or custom selection list; required for most methods |
| stratify_key | str or List[str] | No | Key(s) to group documents by for stratified sampling |
| samples_per_group | bool | No | If True, sample N items per group instead of dividing total (default False) |
| method_kwargs | Dict | No | Method-specific parameters (e.g., embedding_keys, std, query, keys) |
| random_state | int | No | Random seed for reproducibility |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Sampled documents and total cost (cost is 0 for non-embedding methods) |
Usage Examples
# YAML pipeline configuration for uniform sampling
operations:
- name: downsample
type: sample
method: uniform
samples: 100
random_state: 42
# Stratified sampling with top embedding retrieval
operations:
- name: retrieve_relevant
type: sample
method: top_embedding
samples: 10
stratify_key: category
samples_per_group: true
method_kwargs:
keys: ["title", "content"]
query: "machine learning applications in healthcare"
embedding_model: "text-embedding-3-small"
# Python API usage
from docetl.operations.sample import SampleOperation
config = {
"name": "find_outliers",
"type": "sample",
"method": "outliers",
"method_kwargs": {
"embedding_keys": ["text"],
"std": 2.0,
"keep": True,
},
}
sample_op = SampleOperation(runner, config, default_model, max_threads)
outliers, cost = sample_op.execute(input_data)