Principle:FMInference FlexLLMGen Dataset Loading And Serialization
Metadata
| Field | Value |
|---|---|
| Sources | FlexLLMGen|https://github.com/FMInference/FlexLLMGen, Repo|fm_data_tasks|https://github.com/HazyResearch/fm_data_tasks |
| Domains | Data_Preprocessing, NLP |
| Last updated | 2026-02-09 00:00 GMT |
Overview
A data preprocessing pipeline that loads structured datasets from CSV files and serializes tabular rows into natural language text suitable for few-shot prompting of language models.
Description
Data wrangling tasks (entity matching, data imputation, error detection) operate on structured tabular data, but language models process text. The serialization step converts each row into a text representation by concatenating attribute-value pairs (e.g., "title: iPhone 12. brand: Apple. price: $799"). Different task types require different serialization formats: entity matching serializes pairs of records side by side, data imputation highlights the missing attribute, and error detection presents the row for inspection. The read_data() function handles loading train/test/validation splits, optional class balancing, and sampling.
Usage
Use read_data() at the start of a data wrangling workflow to load and prepare datasets from the HazyResearch fm_data_tasks benchmark format.
Theoretical Basis
Tabular-to-text serialization is a key step in using LLMs for structured data tasks. The quality of serialization directly affects model accuracy. Attribute-value pair format with configurable separators (default ".") provides a simple, consistent representation that LLMs can parse.