Principle:Eventual Inc Daft AI Text Embedding

Knowledge Sources	Daft Daft Docs
Domains	Data_Engineering, Machine_Learning
Last Updated	2026-02-08 00:00 GMT

Overview

Technique for computing dense vector embeddings of text data within a distributed dataframe.

Description

Text embedding converts natural language text into fixed-dimensional numerical vectors that capture semantic meaning. In Daft, this is implemented as an expression-level operation that can be applied to any string column, producing a fixed-size list of floating point values (an embedding vector) for each row.

Key capabilities include:

Multi-provider support: Embeddings can be computed using local models (via the transformers provider) or remote API services (via the openai provider or other OpenAI-compatible APIs).
Configurable dimensions: For models and providers that support it, the output embedding dimensionality can be explicitly specified via the dimensions parameter.
Batch processing: Embeddings are computed in batches for efficiency, with configurable batch sizes determined by the provider's UDF options.
Async/sync execution: The embedding function automatically selects between synchronous and asynchronous execution based on the underlying provider (e.g., async for API-based providers, sync for local model inference).
GPU support: For local model inference (e.g., via transformers), GPU resources can be allocated to the UDF workers.

Usage

Use this technique when you need to generate vector representations of text for:

Semantic search and similarity matching
Clustering text documents by topic or meaning
Building recommendation systems based on content similarity
Creating feature vectors for downstream ML models

Theoretical Basis

Text embedding is based on dense vector representation learning where semantically similar texts have proximate vector representations in the embedding space:

Contextual encoding: Modern embedding models (BERT, sentence-transformers, etc.) encode the full context of a text passage into a single vector, capturing nuances of meaning beyond individual words.
Metric space properties: The resulting embedding space typically preserves semantic similarity through distance metrics (cosine similarity, Euclidean distance), enabling nearest-neighbor search for related content.
Fixed-dimensional output: Regardless of input text length, the output vector has a fixed number of dimensions (e.g., 384, 768, 1536), enabling uniform storage and efficient computation.
Transfer learning: Pre-trained embedding models encode general language understanding and can be applied to domain-specific text without fine-tuning.

Pseudocode:
1. Resolve provider (explicit -> session -> environment -> default "transformers")
2. Load text embedder descriptor from provider
3. Determine output dtype: FixedSizeList[Float32; dimensions]
4. Create class-based UDF with concurrency and GPU config
5. For each batch of text in partition:
   a. If async provider: send batch to API, await responses
   b. If sync provider: run local model inference on batch
   c. Return embedding vectors as FixedSizeList column

Related Pages

Implemented By

Implementation:Eventual_Inc_Daft_AI_Embed_Text

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment