Implementation:FMInference FlexLLMGen Read Data
Metadata
| Field | Value |
|---|---|
| Sources | FlexLLMGen|https://github.com/FMInference/FlexLLMGen |
| Domains | Data_Preprocessing, NLP |
| Last updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for loading and serializing structured datasets for LLM-based data wrangling provided by the FlexLLMGen data wrangling application.
Description
read_data() loads train/test/validation CSV splits from a data directory, determines the task type from constants.DATA2TASK, delegates to task-specific serialization (serialize_row for general, serialize_match_pair for entity matching, serialize_imputation for data imputation), applies optional class balancing, shuffles training data, and returns a dict of DataFrames with "text" and "label_str" columns.
Usage
Call with the path to a dataset directory containing train.csv, test.csv, and optionally validation.csv in the HazyResearch fm_data_tasks format.
Code Reference
- Source: flexllmgen/apps/data_wrangle/utils/data_utils.py, Lines: 456-500
- Signature:
def read_data(
data_dir: str,
class_balanced: bool = False,
add_instruction: bool = False,
task_instruction_idx: int = 0,
max_train_samples: int = -1,
max_train_percent: float = -1,
sep_tok: str = ".",
nan_tok: str = "nan",
):
"""Read in data where each directory is unique for a task.
Args:
data_dir: Path to dataset directory with train/test/validation CSVs
class_balanced: Balance training examples by class
add_instruction: Add task instruction to serialized text
task_instruction_idx: Index of instruction variant to use
max_train_samples: Limit training samples (fraction 0-1)
max_train_percent: Alternative training sample limit
sep_tok: Separator between attribute-value pairs (default ".")
nan_tok: Token for missing values (default "nan")
Returns:
Dict[str, pd.DataFrame] with "train", "test", "validation" keys
"""
- Import:
from flexllmgen.apps.data_wrangle.utils.data_utils import read_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_dir | str | Yes | Path to dataset directory |
| class_balanced | bool | No | Balance by class (default False) |
| add_instruction | bool | No | Add task instruction (default False) |
| task_instruction_idx | int | No | Instruction variant (default 0) |
| max_train_samples | int | No | Training sample limit (default -1) |
| max_train_percent | float | No | Training percent limit (default -1) |
| sep_tok | str | No | Separator token (default ".") |
| nan_tok | str | No | Missing value token (default "nan") |
Outputs
Dict[str, pd.DataFrame] with keys "train", "test", "validation" — each DataFrame has "text" column (serialized row text) and "label_str" column (ground truth label string).
Usage Examples
from flexllmgen.apps.data_wrangle.utils.data_utils import read_data
# Load entity matching dataset
data = read_data(
data_dir="fm_data_tasks/data/entity_matching/structured/Amazon-Google",
class_balanced=True,
sep_tok=".",
nan_tok="nan"
)
train_df = data["train"] # DataFrame with "text" and "label_str" columns
test_df = data["test"]
# Each row: text="title: iPhone 12. brand: Apple. price: $799", label_str="Yes"
print(train_df.iloc[0]["text"])
print(train_df.iloc[0]["label_str"])