Workflow:Eventual Inc Daft Multimodal AI Batch Inference

Knowledge Sources	Daft Daft Docs AI Functions Batch Inference Guide
Domains	AI_Ops, Data_Engineering, Multimodal_Processing
Last Updated	2026-02-08 14:00 GMT

Overview

End-to-end process for running AI inference at scale on multimodal data (text, images, audio) using Daft's built-in AI functions and DataFrame API.

Description

This workflow outlines the standard procedure for performing batch AI inference on datasets containing mixed data modalities. It leverages Daft's native multimodal type system (Image, Audio, Tensor) alongside built-in AI functions (prompt, embed_text, embed_image, classify) to process data through LLM and embedding models. The process covers data ingestion from any supported source, multimodal preprocessing (URL download, image decoding, text extraction), AI model invocation with automatic batching and parallelism, structured output extraction, and result persistence. Daft handles all batching, retry logic, and parallelization automatically, making it straightforward to scale from a handful of records to millions.

Usage

Execute this workflow when you have a dataset containing text, images, or other media and need to enrich it with AI-generated outputs such as classifications, embeddings, structured extractions, or free-form text generation. Typical triggers include:

You have product images and need to classify or describe them at scale
You have a text corpus and need to generate embeddings for semantic search
You have documents and need structured data extraction via LLM prompts
You need to run inference across thousands or millions of records with automatic parallelism

Execution Steps

Step 1: Data Ingestion

Load the source dataset into a Daft DataFrame from any supported format. Daft supports reading from Parquet, CSV, JSON, HuggingFace Hub, Iceberg, Delta Lake, SQL databases, and cloud storage (S3, GCS, Azure). The DataFrame is created lazily, meaning no data is loaded until an action triggers execution.

Key considerations:

Choose the appropriate reader function for your data format
Configure IOConfig for cloud storage credentials if needed
Use column projection (select) early to avoid loading unnecessary data
Apply limit() for iterative development before scaling to the full dataset

Step 2: Data Preprocessing

Transform and prepare the raw data for AI model consumption. This step handles multimodal preprocessing including downloading remote files, decoding binary data into typed objects (images, audio), extracting text from structured fields, and filtering or cleaning records.

Key considerations:

Use url.download() to fetch remote files (images, audio, PDFs)
Use decode_image() to convert binary data into Daft Image type for visual display and processing
Use regexp_extract() or string operations to parse structured text fields
Filter out null or malformed records before sending to AI models
Chain with_column() calls to build the preprocessing pipeline lazily

Step 3: AI Model Invocation

Apply Daft's built-in AI functions to the preprocessed data. Daft provides prompt() for LLM text generation, embed_text() and embed_image() for embedding generation, and classify_text() and classify_image() for classification tasks. These functions automatically handle batching, rate limiting, retries, and parallelization.

Key considerations:

Use Pydantic models with the return_format parameter for structured outputs from prompt()
Configure the model and provider parameters (OpenAI, Google, HuggingFace, vLLM, LM Studio)
Set API keys via environment variables or the api_key parameter
Use max_output_tokens to control generation length and cost
Multi-modal prompts can combine text and images in a single call

Step 4: Result Extraction

Extract and reshape the AI model outputs into usable columns. When using structured outputs (Pydantic), the results are returned as struct columns that can be decomposed into individual fields. Embedding outputs are fixed-size lists suitable for vector operations.

Key considerations:

Access struct fields using bracket notation on the column (e.g., col["field_name"])
Use explode() if a single input generates multiple output items
Cast or rename columns as needed for downstream consumption
Validate outputs with where() filters to handle model failures or unexpected results

Step 5: Result Materialization and Storage

Trigger execution and persist the enriched data. Until this step, all operations are lazy. Calling collect(), show(), or a write function materializes the computation graph and produces results. Write to Parquet, CSV, JSON, Iceberg, Delta Lake, LanceDB, Turbopuffer, or other destinations.

Key considerations:

Use write_parquet() for efficient columnar storage of enriched datasets
Use write_mode="overwrite" to replace existing output or "append" to add incrementally
For vector search use cases, write embeddings directly to LanceDB or Turbopuffer
Use show() for interactive development and verification before full writes
Consider partitioning large outputs for downstream query performance

Step 6: Optional Scaling with Ray

For datasets that exceed single-machine capacity, enable distributed execution by switching to the Ray runner. The same DataFrame code runs without modification across a Ray cluster, automatically distributing data loading, preprocessing, and AI inference across worker nodes.

Key considerations:

Call daft.set_runner_ray() before building the DataFrame pipeline
Ensure Ray cluster is configured with sufficient resources (GPUs for inference)
Daft handles data partitioning, shuffling, and backpressure automatically
Monitor execution via Ray dashboard for resource utilization

Execution Diagram

GitHub URL

Workflow Repository