Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FMInference FlexLLMGen Dataset Loading And Serialization

From Leeroopedia


Metadata

Field Value
Sources FlexLLMGen|https://github.com/FMInference/FlexLLMGen, Repo|fm_data_tasks|https://github.com/HazyResearch/fm_data_tasks
Domains Data_Preprocessing, NLP
Last updated 2026-02-09 00:00 GMT

Overview

A data preprocessing pipeline that loads structured datasets from CSV files and serializes tabular rows into natural language text suitable for few-shot prompting of language models.

Description

Data wrangling tasks (entity matching, data imputation, error detection) operate on structured tabular data, but language models process text. The serialization step converts each row into a text representation by concatenating attribute-value pairs (e.g., "title: iPhone 12. brand: Apple. price: $799"). Different task types require different serialization formats: entity matching serializes pairs of records side by side, data imputation highlights the missing attribute, and error detection presents the row for inspection. The read_data() function handles loading train/test/validation splits, optional class balancing, and sampling.

Usage

Use read_data() at the start of a data wrangling workflow to load and prepare datasets from the HazyResearch fm_data_tasks benchmark format.

Theoretical Basis

Tabular-to-text serialization is a key step in using LLMs for structured data tasks. The quality of serialization directly affects model accuracy. Attribute-value pair format with configurable separators (default ".") provides a simple, consistent representation that LLMs can parse.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment