Workflow:Eventual Inc Daft Multimodal AI Batch Inference
| Knowledge Sources | |
|---|---|
| Domains | AI_Ops, Data_Engineering, Multimodal_Processing |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
End-to-end process for running AI inference at scale on multimodal data (text, images, audio) using Daft's built-in AI functions and DataFrame API.
Description
This workflow outlines the standard procedure for performing batch AI inference on datasets containing mixed data modalities. It leverages Daft's native multimodal type system (Image, Audio, Tensor) alongside built-in AI functions (prompt, embed_text, embed_image, classify) to process data through LLM and embedding models. The process covers data ingestion from any supported source, multimodal preprocessing (URL download, image decoding, text extraction), AI model invocation with automatic batching and parallelism, structured output extraction, and result persistence. Daft handles all batching, retry logic, and parallelization automatically, making it straightforward to scale from a handful of records to millions.
Usage
Execute this workflow when you have a dataset containing text, images, or other media and need to enrich it with AI-generated outputs such as classifications, embeddings, structured extractions, or free-form text generation. Typical triggers include:
- You have product images and need to classify or describe them at scale
- You have a text corpus and need to generate embeddings for semantic search
- You have documents and need structured data extraction via LLM prompts
- You need to run inference across thousands or millions of records with automatic parallelism
Execution Steps
Step 1: Data Ingestion
Load the source dataset into a Daft DataFrame from any supported format. Daft supports reading from Parquet, CSV, JSON, HuggingFace Hub, Iceberg, Delta Lake, SQL databases, and cloud storage (S3, GCS, Azure). The DataFrame is created lazily, meaning no data is loaded until an action triggers execution.
Key considerations:
- Choose the appropriate reader function for your data format
- Configure IOConfig for cloud storage credentials if needed
- Use column projection (select) early to avoid loading unnecessary data
- Apply limit() for iterative development before scaling to the full dataset
Step 2: Data Preprocessing
Transform and prepare the raw data for AI model consumption. This step handles multimodal preprocessing including downloading remote files, decoding binary data into typed objects (images, audio), extracting text from structured fields, and filtering or cleaning records.
Key considerations:
- Use url.download() to fetch remote files (images, audio, PDFs)
- Use decode_image() to convert binary data into Daft Image type for visual display and processing
- Use regexp_extract() or string operations to parse structured text fields
- Filter out null or malformed records before sending to AI models
- Chain with_column() calls to build the preprocessing pipeline lazily
Step 3: AI Model Invocation
Apply Daft's built-in AI functions to the preprocessed data. Daft provides prompt() for LLM text generation, embed_text() and embed_image() for embedding generation, and classify_text() and classify_image() for classification tasks. These functions automatically handle batching, rate limiting, retries, and parallelization.
Key considerations:
- Use Pydantic models with the return_format parameter for structured outputs from prompt()
- Configure the model and provider parameters (OpenAI, Google, HuggingFace, vLLM, LM Studio)
- Set API keys via environment variables or the api_key parameter
- Use max_output_tokens to control generation length and cost
- Multi-modal prompts can combine text and images in a single call
Step 4: Result Extraction
Extract and reshape the AI model outputs into usable columns. When using structured outputs (Pydantic), the results are returned as struct columns that can be decomposed into individual fields. Embedding outputs are fixed-size lists suitable for vector operations.
Key considerations:
- Access struct fields using bracket notation on the column (e.g., col["field_name"])
- Use explode() if a single input generates multiple output items
- Cast or rename columns as needed for downstream consumption
- Validate outputs with where() filters to handle model failures or unexpected results
Step 5: Result Materialization and Storage
Trigger execution and persist the enriched data. Until this step, all operations are lazy. Calling collect(), show(), or a write function materializes the computation graph and produces results. Write to Parquet, CSV, JSON, Iceberg, Delta Lake, LanceDB, Turbopuffer, or other destinations.
Key considerations:
- Use write_parquet() for efficient columnar storage of enriched datasets
- Use write_mode="overwrite" to replace existing output or "append" to add incrementally
- For vector search use cases, write embeddings directly to LanceDB or Turbopuffer
- Use show() for interactive development and verification before full writes
- Consider partitioning large outputs for downstream query performance
Step 6: Optional Scaling with Ray
For datasets that exceed single-machine capacity, enable distributed execution by switching to the Ray runner. The same DataFrame code runs without modification across a Ray cluster, automatically distributing data loading, preprocessing, and AI inference across worker nodes.
Key considerations:
- Call daft.set_runner_ray() before building the DataFrame pipeline
- Ensure Ray cluster is configured with sufficient resources (GPUs for inference)
- Daft handles data partitioning, shuffling, and backpressure automatically
- Monitor execution via Ray dashboard for resource utilization