Principle:FMInference FlexLLMGen Dataset Loading And Serialization

Metadata

Field	Value
Sources	FlexLLMGen\|https://github.com/FMInference/FlexLLMGen, Repo\|fm_data_tasks\|https://github.com/HazyResearch/fm_data_tasks
Domains	Data_Preprocessing, NLP
Last updated	2026-02-09 00:00 GMT

Overview

A data preprocessing pipeline that loads structured datasets from CSV files and serializes tabular rows into natural language text suitable for few-shot prompting of language models.

Description

Data wrangling tasks (entity matching, data imputation, error detection) operate on structured tabular data, but language models process text. The serialization step converts each row into a text representation by concatenating attribute-value pairs (e.g., "title: iPhone 12. brand: Apple. price: $799"). Different task types require different serialization formats: entity matching serializes pairs of records side by side, data imputation highlights the missing attribute, and error detection presents the row for inspection. The read_data() function handles loading train/test/validation splits, optional class balancing, and sampling.

Usage

Use read_data() at the start of a data wrangling workflow to load and prepare datasets from the HazyResearch fm_data_tasks benchmark format.

Theoretical Basis

Tabular-to-text serialization is a key step in using LLMs for structured data tasks. The quality of serialization directly affects model accuracy. Attribute-value pair format with configurable separators (default ".") provides a simple, consistent representation that LLMs can parse.

Related Pages

Implementation:FMInference_FlexLLMGen_Read_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment