Implementation:Eventual Inc Daft AI Embed Text
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Machine_Learning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for computing text embeddings on DataFrame columns provided by the Daft library.
Description
The embed_text function returns an expression that embeds text using a specified embedding model and provider. It supports both local model inference (via the transformers provider) and remote API-based embedding (via the openai provider or other compatible APIs). The function automatically selects between synchronous and asynchronous execution based on the provider, and supports configurable output dimensions, batch sizes, GPU allocation, and concurrency.
Usage
Import and use this function when you need to compute dense vector embeddings of text data for semantic search, clustering, or downstream ML tasks.
Code Reference
Source Location
- Repository: Daft
- File:
daft/functions/ai/__init__.py - Lines: L72-154
Signature
def embed_text(
text: Expression,
*,
provider: str | Provider | None = None,
model: str | None = None,
dimensions: int | None = None,
**options: Unpack[EmbedTextOptions],
) -> Expression
Import
from daft.functions.ai import embed_text
# or
from daft.functions import embed_text
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | Expression (String) | Yes | The input text column expression to embed. |
| provider | Provider | None | No | The embedding provider (e.g., "transformers", "openai"). Defaults to "transformers" when not specified.
|
| model | None | No | The embedding model name (e.g., "sentence-transformers/all-MiniLM-L6-v2"). If None, the provider's default model is used.
|
| dimensions | None | No | Number of output embedding dimensions, if the provider and model support specifying. If None, uses the model's default.
|
| **options | EmbedTextOptions | No | Additional provider-specific options (e.g., batch_size, concurrency). |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Expression (FixedSizeList[Float32]) | An Embedding expression containing fixed-size float vectors representing the text embeddings. |
Usage Examples
Basic Usage
import daft
from daft.functions import embed_text
df = daft.from_pydict({"text": ["Hello world", "Daft is a distributed dataframe"]})
df = df.with_column(
"embeddings",
embed_text(
daft.col("text"),
provider="transformers",
model="sentence-transformers/all-MiniLM-L6-v2",
),
)
df.show()
Using OpenAI Provider
import daft
from daft.functions import embed_text
df = daft.from_pydict({"text": ["semantic search query", "document to embed"]})
df = df.with_column(
"embeddings",
embed_text(
daft.col("text"),
provider="openai",
model="text-embedding-3-small",
dimensions=256,
),
)
df.show()