Workflow:FMInference FlexLLMGen Data Wrangling Batch Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Data_Wrangling, Batch_Processing |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for running high-throughput LLM-based data wrangling tasks (entity matching, data imputation, and error detection) over structured datasets using FlexLLMGen's offloaded inference engine.
Description
This workflow demonstrates how to apply large OPT models to structured data wrangling tasks following the HazyResearch fm_data_tasks benchmark. It covers loading structured CSV datasets, constructing few-shot prompts from training examples, running batched inference through FlexLLMGen's offloading engine, and evaluating predictions with precision, recall, and F1 metrics. The workflow supports three task types: entity matching (determining if two records refer to the same entity), data imputation (predicting missing attribute values), and error detection (identifying incorrect values in records). It supports both single-query mode (for correctness verification) and batched mode (for throughput measurement).
Usage
Execute this workflow when you need to apply a large language model to structured data quality tasks such as deduplication, missing value prediction, or data cleaning, and you want to maximize throughput on hardware with limited GPU memory. The workflow is designed for batch processing of datasets with long input sequences (123-1274 tokens) and short output sequences (3-10 tokens).
Execution Steps
Step 1: Install Dependencies and Download Datasets
Install additional Python libraries required for data processing (pandas, sentence-transformers, rich, pyarrow) and download the fm_data_tasks benchmark datasets from HazyResearch. The datasets include entity matching pairs, data imputation records, and error detection examples across 10 benchmark tasks.
Key considerations:
- Run the install.sh script to install dependencies and download datasets
- Datasets cover 7 entity matching tasks (Fodors-Zagats, Beer, iTunes-Amazon, etc.), 2 data imputation tasks (Restaurant, Buy), and 1 error detection task (Hospital)
- Each dataset includes train/test splits in CSV format
Step 2: Configure Task and Model Parameters
Select the data wrangling task (entity matching, data imputation, or error detection), the specific dataset, the OPT model size, and the FlexLLMGen offloading policy. Also configure prompt construction parameters such as the number of few-shot examples and the prompting strategy (manual, random, or embedding-based selection).
Key considerations:
- Task type determines the prompt format and expected output format
- Choose between single-query mode (--run_single_query) for correctness verification and batch mode for throughput
- The --num_run parameter controls how many test examples to evaluate
- Offloading policy must account for the long input sequences typical of data wrangling tasks
Step 3: Load and Serialize Dataset
Load the structured CSV dataset and serialize each record or record pair into text format suitable for LLM inference. The serialization converts tabular data into natural language descriptions, handling different column schemas and task-specific formatting rules for each dataset.
Key considerations:
- Entity matching serializes two records side-by-side for comparison
- Data imputation masks the target column and asks the model to predict it
- Error detection presents a record and asks the model to identify incorrect values
- Dataset-specific constants define column schemas, instructions, and output formats
Step 4: Construct Few-shot Prompts
Build few-shot prompts by prepending training examples to each test query. Three prompting strategies are available: manually crafted examples, randomly sampled training examples, or embedding-based selection of the most relevant training examples using sentence-transformers for similarity matching.
Key considerations:
- Few-shot examples provide in-context learning signal for the model
- The number of examples is configurable (typically 3-5)
- Embedding-based selection finds the hardest or most relevant examples for each query
- Prompts include task-specific instructions and output format specifications
Step 5: Run Batched Inference
Process all evaluation queries through FlexLLMGen's generation pipeline. In batch mode, queries are padded to uniform length and processed in batches through the offloaded inference engine. In single-query mode, each query is processed individually for correctness verification.
Key considerations:
- Batch mode groups queries by --gpu-batch-size and --num-gpu-batches
- Queries are padded to --pad-to-seq-len for uniform batching
- Output length is short (3-10 tokens) compared to input length (100-1300 tokens)
- The model is reinitialized per batch to handle varying padded sequence lengths
Step 6: Evaluate Predictions
Parse the model's generated outputs, extract predictions, and compute evaluation metrics (precision, recall, accuracy, F1 score) by comparing against ground truth labels. Results are logged with throughput measurements including both input and output tokens per second.
Key considerations:
- Entity matching outputs are parsed as Yes/No responses
- Data imputation outputs are compared against the masked attribute value
- Error detection outputs identify the erroneous column
- Throughput is measured as (input_tokens + output_tokens) / total_time since prefill dominates runtime