Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen Read Data

From Leeroopedia
Revision as of 14:56, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FMInference_FlexLLMGen_Read_Data.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Sources FlexLLMGen|https://github.com/FMInference/FlexLLMGen
Domains Data_Preprocessing, NLP
Last updated 2026-02-09 00:00 GMT

Overview

Concrete tool for loading and serializing structured datasets for LLM-based data wrangling provided by the FlexLLMGen data wrangling application.

Description

read_data() loads train/test/validation CSV splits from a data directory, determines the task type from constants.DATA2TASK, delegates to task-specific serialization (serialize_row for general, serialize_match_pair for entity matching, serialize_imputation for data imputation), applies optional class balancing, shuffles training data, and returns a dict of DataFrames with "text" and "label_str" columns.

Usage

Call with the path to a dataset directory containing train.csv, test.csv, and optionally validation.csv in the HazyResearch fm_data_tasks format.

Code Reference

  • Source: flexllmgen/apps/data_wrangle/utils/data_utils.py, Lines: 456-500
  • Signature:
def read_data(
    data_dir: str,
    class_balanced: bool = False,
    add_instruction: bool = False,
    task_instruction_idx: int = 0,
    max_train_samples: int = -1,
    max_train_percent: float = -1,
    sep_tok: str = ".",
    nan_tok: str = "nan",
):
    """Read in data where each directory is unique for a task.

    Args:
        data_dir: Path to dataset directory with train/test/validation CSVs
        class_balanced: Balance training examples by class
        add_instruction: Add task instruction to serialized text
        task_instruction_idx: Index of instruction variant to use
        max_train_samples: Limit training samples (fraction 0-1)
        max_train_percent: Alternative training sample limit
        sep_tok: Separator between attribute-value pairs (default ".")
        nan_tok: Token for missing values (default "nan")
    Returns:
        Dict[str, pd.DataFrame] with "train", "test", "validation" keys
    """
  • Import:
from flexllmgen.apps.data_wrangle.utils.data_utils import read_data

I/O Contract

Inputs

Name Type Required Description
data_dir str Yes Path to dataset directory
class_balanced bool No Balance by class (default False)
add_instruction bool No Add task instruction (default False)
task_instruction_idx int No Instruction variant (default 0)
max_train_samples int No Training sample limit (default -1)
max_train_percent float No Training percent limit (default -1)
sep_tok str No Separator token (default ".")
nan_tok str No Missing value token (default "nan")

Outputs

Dict[str, pd.DataFrame] with keys "train", "test", "validation" — each DataFrame has "text" column (serialized row text) and "label_str" column (ground truth label string).

Usage Examples

from flexllmgen.apps.data_wrangle.utils.data_utils import read_data

# Load entity matching dataset
data = read_data(
    data_dir="fm_data_tasks/data/entity_matching/structured/Amazon-Google",
    class_balanced=True,
    sep_tok=".",
    nan_tok="nan"
)

train_df = data["train"]   # DataFrame with "text" and "label_str" columns
test_df = data["test"]

# Each row: text="title: iPhone 12. brand: Apple. price: $799", label_str="Yes"
print(train_df.iloc[0]["text"])
print(train_df.iloc[0]["label_str"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment