Principle:Eventual Inc Daft AI Image Embedding
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Computer_Vision |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Technique for computing dense vector embeddings of image data within a distributed dataframe.
Description
Image embedding converts visual data into fixed-dimensional numerical vectors that capture visual semantics. In Daft, this is implemented as an expression-level operation applied to Image columns, producing a fixed-size list of floating point values for each image.
Key capabilities include:
- Multi-provider support: Image embeddings can be computed using local models (via the
transformersprovider with models likeapple/aimv2-large-patch14-224-litor CLIP variants) or remote API services. - Image preprocessing pipeline: Images typically need to be decoded, converted to RGB, and resized before embedding. Daft's expression chaining (e.g.,
.convert_image("RGB").resize(288, 288)) enables this as a single pipeline. - Batch processing: Images are embedded in batches for efficiency, leveraging GPU parallelism for local models or API batching for remote services.
- Async/sync execution: Like text embeddings, the function automatically selects between synchronous (local model) and asynchronous (API-based) execution.
- GPU support: Local vision model inference benefits from GPU acceleration, with configurable GPU allocation per UDF worker.
Usage
Use this technique when you need to generate vector representations of images for:
- Visual search and image similarity matching
- Multimodal applications combining image and text embeddings
- Image clustering and deduplication
- Building visual recommendation systems
Theoretical Basis
Image embedding is based on visual feature extraction using pre-trained vision models that map images to dense vector spaces:
- Vision transformer encoding: Modern image embedding models (ViT, CLIP, AIMv2) divide images into patches, process them through transformer layers, and produce a single vector representation capturing high-level visual features.
- Contrastive learning: Many image embedding models are trained using contrastive objectives (e.g., CLIP) that align image and text representations in a shared embedding space, enabling cross-modal similarity.
- Resolution normalization: Input images are resized to a fixed resolution (e.g., 224x224, 288x288) before encoding, ensuring consistent input dimensions for the model.
- Spatial feature aggregation: The model aggregates spatial features from all image patches into a single global vector, capturing both local details and global composition.
Pseudocode:
1. Resolve provider (explicit -> session -> environment -> default "transformers")
2. Load image embedder descriptor from provider
3. Determine output dtype: FixedSizeList[Float32; dimensions]
4. Create class-based UDF with concurrency and GPU config
5. For each batch of images in partition:
a. Preprocess images (resize, normalize)
b. If async provider: send batch to API, await responses
c. If sync provider: run local vision model inference on batch
d. Return embedding vectors as FixedSizeList column