Principle:Eventual Inc Daft AI Image Embedding

Knowledge Sources	Daft Daft Docs
Domains	Data_Engineering, Computer_Vision
Last Updated	2026-02-08 00:00 GMT

Overview

Technique for computing dense vector embeddings of image data within a distributed dataframe.

Description

Image embedding converts visual data into fixed-dimensional numerical vectors that capture visual semantics. In Daft, this is implemented as an expression-level operation applied to Image columns, producing a fixed-size list of floating point values for each image.

Key capabilities include:

Multi-provider support: Image embeddings can be computed using local models (via the transformers provider with models like apple/aimv2-large-patch14-224-lit or CLIP variants) or remote API services.
Image preprocessing pipeline: Images typically need to be decoded, converted to RGB, and resized before embedding. Daft's expression chaining (e.g., .convert_image("RGB").resize(288, 288)) enables this as a single pipeline.
Batch processing: Images are embedded in batches for efficiency, leveraging GPU parallelism for local models or API batching for remote services.
Async/sync execution: Like text embeddings, the function automatically selects between synchronous (local model) and asynchronous (API-based) execution.
GPU support: Local vision model inference benefits from GPU acceleration, with configurable GPU allocation per UDF worker.

Usage

Use this technique when you need to generate vector representations of images for:

Visual search and image similarity matching
Multimodal applications combining image and text embeddings
Image clustering and deduplication
Building visual recommendation systems

Theoretical Basis

Image embedding is based on visual feature extraction using pre-trained vision models that map images to dense vector spaces:

Vision transformer encoding: Modern image embedding models (ViT, CLIP, AIMv2) divide images into patches, process them through transformer layers, and produce a single vector representation capturing high-level visual features.
Contrastive learning: Many image embedding models are trained using contrastive objectives (e.g., CLIP) that align image and text representations in a shared embedding space, enabling cross-modal similarity.
Resolution normalization: Input images are resized to a fixed resolution (e.g., 224x224, 288x288) before encoding, ensuring consistent input dimensions for the model.
Spatial feature aggregation: The model aggregates spatial features from all image patches into a single global vector, capturing both local details and global composition.

Pseudocode:
1. Resolve provider (explicit -> session -> environment -> default "transformers")
2. Load image embedder descriptor from provider
3. Determine output dtype: FixedSizeList[Float32; dimensions]
4. Create class-based UDF with concurrency and GPU config
5. For each batch of images in partition:
   a. Preprocess images (resize, normalize)
   b. If async provider: send batch to API, await responses
   c. If sync provider: run local vision model inference on batch
   d. Return embedding vectors as FixedSizeList column

Related Pages

Implemented By

Implementation:Eventual_Inc_Daft_AI_Embed_Image

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment