Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl SampleOperation Execute

From Leeroopedia


Knowledge Sources
Domains Data_Processing, Data_Sampling
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for selecting document subsets using various sampling strategies with optional stratification, provided by DocETL.

Description

The SampleOperation class extends BaseOperation to select subsets of documents from the input using six distinct sampling methods: "uniform" (random sampling), "first" (take first N), "outliers" (embedding-based outlier detection via distance from centroid), "custom" (user-provided key-based selection), "top_embedding" (most similar to a query by embedding cosine similarity), and "top_fts" (BM25-based full-text search scoring). All methods except "custom" support stratified sampling by grouping documents on one or more keys and allocating sample budgets proportionally or per group.

Usage

Use this operation when you need to reduce dataset size for cost or performance reasons, select representative subsets for analysis, identify outlier documents, or retrieve the most relevant documents by similarity to a query. Typical scenarios include downsampling large datasets before expensive LLM operations, selecting diverse representatives via stratified sampling, or finding documents most related to a search query.

Code Reference

Source Location

Signature

class SampleOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "sample"
        method: Literal["uniform", "outliers", "custom", "first", "top_embedding", "top_fts"]
        samples: Union[int, float, list] | None = None
        stratify_key: Union[str, list[str]] | None = None
        samples_per_group: bool = False
        method_kwargs: dict[str, Any] | None = Field(default_factory=dict)
        random_state: int | None = Field(None, ge=0)

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...
    def _sample_first(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_uniform(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_outliers(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_custom(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_top_embedding(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_top_fts(self, input_data) -> tuple[list[dict], float]: ...
    def _sample_with_stratification(self, input_data, method) -> tuple[list[dict], float]: ...

Import

from docetl.operations.sample import SampleOperation

I/O Contract

Inputs

Name Type Required Description
input_data List[Dict] Yes Documents to sample from
method str Yes Sampling method: "uniform", "first", "outliers", "custom", "top_embedding", or "top_fts"
samples int, float, or list Conditional Number of samples (int), fraction (float), or custom selection list; required for most methods
stratify_key str or List[str] No Key(s) to group documents by for stratified sampling
samples_per_group bool No If True, sample N items per group instead of dividing total (default False)
method_kwargs Dict No Method-specific parameters (e.g., embedding_keys, std, query, keys)
random_state int No Random seed for reproducibility

Outputs

Name Type Description
output Tuple[List[Dict], float] Sampled documents and total cost (cost is 0 for non-embedding methods)

Usage Examples

# YAML pipeline configuration for uniform sampling
operations:
  - name: downsample
    type: sample
    method: uniform
    samples: 100
    random_state: 42

# Stratified sampling with top embedding retrieval
operations:
  - name: retrieve_relevant
    type: sample
    method: top_embedding
    samples: 10
    stratify_key: category
    samples_per_group: true
    method_kwargs:
      keys: ["title", "content"]
      query: "machine learning applications in healthcare"
      embedding_model: "text-embedding-3-small"
# Python API usage
from docetl.operations.sample import SampleOperation

config = {
    "name": "find_outliers",
    "type": "sample",
    "method": "outliers",
    "method_kwargs": {
        "embedding_keys": ["text"],
        "std": 2.0,
        "keep": True,
    },
}
sample_op = SampleOperation(runner, config, default_model, max_threads)
outliers, cost = sample_op.execute(input_data)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment