Implementation:FMInference FlexLLMGen Read Data

Metadata

Field	Value
Sources	FlexLLMGen\|https://github.com/FMInference/FlexLLMGen
Domains	Data_Preprocessing, NLP
Last updated	2026-02-09 00:00 GMT

Overview

Concrete tool for loading and serializing structured datasets for LLM-based data wrangling provided by the FlexLLMGen data wrangling application.

Description

read_data() loads train/test/validation CSV splits from a data directory, determines the task type from constants.DATA2TASK, delegates to task-specific serialization (serialize_row for general, serialize_match_pair for entity matching, serialize_imputation for data imputation), applies optional class balancing, shuffles training data, and returns a dict of DataFrames with "text" and "label_str" columns.

Usage

Call with the path to a dataset directory containing train.csv, test.csv, and optionally validation.csv in the HazyResearch fm_data_tasks format.

Code Reference

Source: flexllmgen/apps/data_wrangle/utils/data_utils.py, Lines: 456-500
Signature:

def read_data(
    data_dir: str,
    class_balanced: bool = False,
    add_instruction: bool = False,
    task_instruction_idx: int = 0,
    max_train_samples: int = -1,
    max_train_percent: float = -1,
    sep_tok: str = ".",
    nan_tok: str = "nan",
):
    """Read in data where each directory is unique for a task.

    Args:
        data_dir: Path to dataset directory with train/test/validation CSVs
        class_balanced: Balance training examples by class
        add_instruction: Add task instruction to serialized text
        task_instruction_idx: Index of instruction variant to use
        max_train_samples: Limit training samples (fraction 0-1)
        max_train_percent: Alternative training sample limit
        sep_tok: Separator between attribute-value pairs (default ".")
        nan_tok: Token for missing values (default "nan")
    Returns:
        Dict[str, pd.DataFrame] with "train", "test", "validation" keys
    """

Import:

from flexllmgen.apps.data_wrangle.utils.data_utils import read_data

I/O Contract

Inputs

Name	Type	Required	Description
data_dir	str	Yes	Path to dataset directory
class_balanced	bool	No	Balance by class (default False)
add_instruction	bool	No	Add task instruction (default False)
task_instruction_idx	int	No	Instruction variant (default 0)
max_train_samples	int	No	Training sample limit (default -1)
max_train_percent	float	No	Training percent limit (default -1)
sep_tok	str	No	Separator token (default ".")
nan_tok	str	No	Missing value token (default "nan")

Outputs

Dict[str, pd.DataFrame] with keys "train", "test", "validation" — each DataFrame has "text" column (serialized row text) and "label_str" column (ground truth label string).

Usage Examples

from flexllmgen.apps.data_wrangle.utils.data_utils import read_data

# Load entity matching dataset
data = read_data(
    data_dir="fm_data_tasks/data/entity_matching/structured/Amazon-Google",
    class_balanced=True,
    sep_tok=".",
    nan_tok="nan"
)

train_df = data["train"]   # DataFrame with "text" and "label_str" columns
test_df = data["test"]

# Each row: text="title: iPhone 12. brand: Apple. price: $799", label_str="Yes"
print(train_df.iloc[0]["text"])
print(train_df.iloc[0]["label_str"])

Related Pages

Principle:FMInference_FlexLLMGen_Dataset_Loading_And_Serialization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment