Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Legacy Data Loader

From Leeroopedia
Revision as of 12:28, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Evidentlyai_Evidently_Legacy_Data_Loader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains ML Monitoring, Data Loading, Data Pipeline
Last Updated 2026-02-14 12:00 GMT

Overview

DataLoader provides CSV data loading with configurable sampling strategies (none, nth-row, and randomized) for loading datasets into pandas DataFrames in the Evidently legacy pipeline.

Description

This module defines a data loading subsystem with three main components:

SamplingOptions (dataclass) -- configures how rows are sampled during loading:

  • type -- sampling strategy: "none" (load all), "nth" (every nth row), or "random" (random sampling). Default: "none".
  • random_seed -- seed for reproducible random sampling. Default: 1.
  • ratio -- probability ratio for random sampling (0.0 to 1.0). Default: 1.0.
  • n -- interval for nth-row sampling. Default: 1.

DataOptions (dataclass) -- configures CSV parsing:

  • date_column -- column name to parse as datetime. Default: "datetime".
  • separator -- CSV field separator. Default: ",".
  • header -- whether the CSV has a header row. Default: True.
  • column_names -- explicit column names, or None to infer from data.

DataLoader -- the main class with a single load method that:

  • Reads a CSV file using pd.read_csv
  • Applies a skiprows function based on the sampling options
  • Parses the date column if specified
  • Handles the header row based on DataOptions.header

RandomizedSkipRows -- an internal class that implements chunk-based random row selection. It generates random boolean arrays in chunks of CHUNK_SIZE (1000) rows for memory-efficient random sampling.

Internal helper functions:

  • _skiprows(sampling_options) -- resolves the sampling type to a callable skip function or None
  • __simple(sampling_options) -- creates a skip function for nth-row sampling (keeps rows where row_idx % n == 1)

Usage

Use DataLoader when loading CSV data files for Evidently analysis, particularly when you need to sample large datasets for faster processing or development iteration.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class SamplingOptions:
    type: str = "none"
    random_seed: int = 1
    ratio: float = 1.0
    n: int = 1

@dataclasses.dataclass
class DataOptions:
    date_column: str
    separator: str
    header: bool
    column_names: Optional[List[str]]

    def __init__(self, date_column="datetime", separator=",", header=True, column_names=None):
        ...

class DataLoader:
    def __init__(self): ...
    def load(
        self,
        filename: str,
        data_options: DataOptions,
        sampling_options: SamplingOptions = None,
    ) -> pd.DataFrame: ...

CHUNK_SIZE = 1000

class RandomizedSkipRows:
    def __init__(self, ratio: float, random_seed: int): ...
    def skiprows(self, row_index: int) -> bool: ...

Import

from evidently.legacy.runner.loader import DataLoader
from evidently.legacy.runner.loader import DataOptions
from evidently.legacy.runner.loader import SamplingOptions

I/O Contract

Inputs

Name Type Required Description
filename str Yes Path to the CSV file to load.
data_options DataOptions Yes CSV parsing configuration (date column, separator, header, column names).
sampling_options SamplingOptions No Row sampling configuration. Defaults to no sampling.

Outputs

Name Type Description
return pd.DataFrame A pandas DataFrame containing the loaded and optionally sampled data.

Usage Examples

from evidently.legacy.runner.loader import DataLoader, DataOptions, SamplingOptions

loader = DataLoader()

# Load entire CSV with default options
data_options = DataOptions(date_column="datetime", separator=",")
df = loader.load("data/train.csv", data_options)

# Load with nth-row sampling (every 5th row)
sampling = SamplingOptions(type="nth", n=5)
df_sampled = loader.load("data/train.csv", data_options, sampling_options=sampling)

# Load with random sampling (50% of rows)
sampling = SamplingOptions(type="random", ratio=0.5, random_seed=42)
df_random = loader.load("data/train.csv", data_options, sampling_options=sampling)

# Load a CSV without header and with custom separator
data_options = DataOptions(
    date_column=None,
    separator="\t",
    header=False,
    column_names=["col1", "col2", "col3"],
)
df = loader.load("data/raw.tsv", data_options)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment